zheng 2015

9
IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015 11 Raw Wind Data Preprocessing: A Data-Mining Approach Le Zheng, Wei Hu, and Yong Min Abstract—Wind energy integration research generally relies on complex sensors located at remote sites. The procedure for generating high-level synthetic information from databases containing large amounts of low-level data must therefore account for possible sensor failures and imperfect input data. The data input is highly sensitive to data quality. To address this problem, this paper presents an empirical methodology that can efficiently preprocess and filter the raw wind data using only aggregated active power output and the corresponding wind speed values at the wind farm. First, raw wind data properties are analyzed, and all the data are divided into six categories according to their attribute magnitudes from a statistical perspective. Next, the weighted distance, a novel concept of the degree of similarity between the individual objects in the wind database and the local outlier factor (LOF) algorithm, is incorporated to compute the outlier factor of every individual object, and this outlier factor is then used to assess which category an object belongs to. Finally, the methodology was tested successfully on the data collected from a large wind farm in northwest China. Index Terms—Data mining, data preprocessing, local outlier factor (LOF), unsupervised learning. NOMENCLATURE V , V (x) Wind speed measured from the wind farm meteorological mast, wind speed value of object x. V ci Cut-in speed of the wind turbine. V r Rated speed of the wind turbine. V co Cut-out speed of the wind turbine. P , P (x) Wind power value measured from SCADA system, when the speed equals V , V (x). P A , P T Approximate and accurate wind power value, when the speed equals V . d(x, y) Distance between object x and y, i.e., the concept of the degree of similarity between x and y. ω Weight of the notation of the weighted distance. Manuscript received April 23, 2014; revised July 02, 2014 and August 11, 2014; accepted September 03, 2014. Date of publication September 29, 2014; date of current version December 12, 2014. This work was supported in part by the National High Technology Research and Development Program 2011AA05A112 of China, in part by the National Natural Science Foundation of China under Grant 51190101, in part by the Science and Technology Projects of the State Grid Corporation of China SGHN0000DKJS1300221, in part by Hunan Electric Power Corporation, and in part by Ningxia Electric Power Corporation. Paper no. TSTE-00173-2014. The authors are with the State Key Lab of Power Systems, Depart- ment of Electrical Engineering, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]. edu.cn; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSTE.2014.2355837 T Tuning parameter of the weighted distance. N k (x) k-distance neighborhood of object x. Lrd(x) Local reachability density of object x. LOF(x) Local outlier factor of object x. N cubic , N linear Number of valid data points detected by the cubic and linear formula weighted distance, respectively. N common Number of valid data points detected by both cubic and linear formula weighted distance. I. I NTRODUCTION I N RECENT years, wind energy has become a major energy source. Wind farm power curve monitoring [1]–[3] and wind power prediction [4], [5] constitute the foundation of wind energy integration research. Because precise modeling of the wind source is difficult, and the wind turbine is highly nonlinear, researchers prefer data-mining methods over analyt- ical methods to generate high-level synthetic information (also known as knowledge) from the low-level data collected by real- time data acquisition systems. However, as the acquisition and transmission of wind data rely on the reliability of sensors that are located at remote sites exposed to an open, uncon- trolled, and even harsh environment, there is a relatively high probability of the occurrence of incorrect data. On the other hand, unnatural operating states of a wind farm cause unnatu- ral data. For example, wind curtailment because of congestion or load balancing purposes or wind turbine shutdown because of mechanical faults or maintenance will result in unnatural data, which have normal wind speed and abnormal wind power output below the theoretical values corresponding to the wind speed. Both incorrect and unnatural data affect the performance of the data-based research, as data-mining methods are highly sensitive to data quality. Incorrect and unnatural data should be detected and preprocessed before the integration studies. Different approaches have been proposed to address prepro- cessing problems for various types of data [6]–[10], including load data, remote terminal unit (RTU) data, geophysical data, fingerprint image data, and photovoltaic data. However, few papers specifically discuss wind data preprocessing. The pre- processing descriptions comprise only a small part of the relevant works. Schlechtingen and Santos [11] presented a wind data prepro- cessing method including four steps: 1) validity check; 2) data scaling; 3) missing data processing; and 4) lag removal. The validity check involves a data range check that detects data values exceeding the physical limits. Data scaling normalizes data with the ratings. Missing data processing involves either 1949-3029 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: muhammad-khalid

Post on 13-Apr-2016

238 views

Category:

Documents


0 download

DESCRIPTION

Research Paper

TRANSCRIPT

Page 1: Zheng 2015

IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015 11

Raw Wind Data Preprocessing:A Data-Mining Approach

Le Zheng, Wei Hu, and Yong Min

Abstract—Wind energy integration research generally relieson complex sensors located at remote sites. The procedure forgenerating high-level synthetic information from databasescontaining large amounts of low-level data must therefore accountfor possible sensor failures and imperfect input data. The datainput is highly sensitive to data quality. To address this problem,this paper presents an empirical methodology that can efficientlypreprocess and filter the raw wind data using only aggregatedactive power output and the corresponding wind speed valuesat the wind farm. First, raw wind data properties are analyzed,and all the data are divided into six categories according to theirattribute magnitudes from a statistical perspective. Next, theweighted distance, a novel concept of the degree of similaritybetween the individual objects in the wind database and the localoutlier factor (LOF) algorithm, is incorporated to compute theoutlier factor of every individual object, and this outlier factor isthen used to assess which category an object belongs to. Finally,the methodology was tested successfully on the data collectedfrom a large wind farm in northwest China.

Index Terms—Data mining, data preprocessing, local outlierfactor (LOF), unsupervised learning.

NOMENCLATURE

V , V (x) Wind speed measured from the wind farmmeteorological mast, wind speed value ofobject x.

Vci Cut-in speed of the wind turbine.Vr Rated speed of the wind turbine.Vco Cut-out speed of the wind turbine.P , P (x) Wind power value measured from SCADA

system, when the speed equals V , V (x).PA, PT Approximate and accurate wind power value,

when the speed equals V .d(x, y) Distance between object x and y, i.e., the

concept of the degree of similarity between xand y.

ω Weight of the notation of the weighteddistance.

Manuscript received April 23, 2014; revised July 02, 2014 and August11, 2014; accepted September 03, 2014. Date of publication September 29,2014; date of current version December 12, 2014. This work was supportedin part by the National High Technology Research and Development Program2011AA05A112 of China, in part by the National Natural Science Foundationof China under Grant 51190101, in part by the Science and Technology Projectsof the State Grid Corporation of China SGHN0000DKJS1300221, in part byHunan Electric Power Corporation, and in part by Ningxia Electric PowerCorporation. Paper no. TSTE-00173-2014.

The authors are with the State Key Lab of Power Systems, Depart-ment of Electrical Engineering, Tsinghua University, Beijing 100084,China (e-mail: [email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSTE.2014.2355837

T Tuning parameter of the weighted distance.Nk(x) k-distance neighborhood of object x.Lrd(x) Local reachability density of object x.LOF(x) Local outlier factor of object x.Ncubic, Nlinear Number of valid data points detected by the

cubic and linear formula weighted distance,respectively.

Ncommon Number of valid data points detected by bothcubic and linear formula weighted distance.

I. INTRODUCTION

I N RECENT years, wind energy has become a major energysource. Wind farm power curve monitoring [1]–[3] and

wind power prediction [4], [5] constitute the foundation ofwind energy integration research. Because precise modelingof the wind source is difficult, and the wind turbine is highlynonlinear, researchers prefer data-mining methods over analyt-ical methods to generate high-level synthetic information (alsoknown as knowledge) from the low-level data collected by real-time data acquisition systems. However, as the acquisition andtransmission of wind data rely on the reliability of sensorsthat are located at remote sites exposed to an open, uncon-trolled, and even harsh environment, there is a relatively highprobability of the occurrence of incorrect data. On the otherhand, unnatural operating states of a wind farm cause unnatu-ral data. For example, wind curtailment because of congestionor load balancing purposes or wind turbine shutdown becauseof mechanical faults or maintenance will result in unnaturaldata, which have normal wind speed and abnormal wind poweroutput below the theoretical values corresponding to the windspeed. Both incorrect and unnatural data affect the performanceof the data-based research, as data-mining methods are highlysensitive to data quality. Incorrect and unnatural data should bedetected and preprocessed before the integration studies.

Different approaches have been proposed to address prepro-cessing problems for various types of data [6]–[10], includingload data, remote terminal unit (RTU) data, geophysical data,fingerprint image data, and photovoltaic data. However, fewpapers specifically discuss wind data preprocessing. The pre-processing descriptions comprise only a small part of therelevant works.

Schlechtingen and Santos [11] presented a wind data prepro-cessing method including four steps: 1) validity check; 2) datascaling; 3) missing data processing; and 4) lag removal. Thevalidity check involves a data range check that detects datavalues exceeding the physical limits. Data scaling normalizesdata with the ratings. Missing data processing involves either

1949-3029 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Zheng 2015

12 IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

neglecting or approximating the missing values. Lag removaluses the cross correlation function to identify the lag betweeninput and output, which is useful when dealing with time-seriesanalysis. This method, especially the validity check and miss-ing data processing steps, has been widely used in the publishedliterature [2]–[4].

However, the method proposed in [11] did not considerunnatural data. References [12] and [13] discussed unnaturaldata by classifying the raw data according to the magnitude ofthe wind speed and the wind power output. A neural networkclassifier was trained with the wind speed and wind power asinput and the classification result as output. Then, the neuralnetwork was used to classify more data. As a type of super-vised learning algorithm, this method can achieve relativelyhigh accuracy as long as the classification result is accurate,i.e., the correct class is precisely determined for every singledata point. Liu et al. [12] classified the data points according toartificial judgment, whereas Ding [13] classified the data pointsbased on the wind farm operation state records.

In real-world applications, artificial judgment is limited andinconvenient when the size of the database is large, and thewind farm operation state records are often unavailable. Thus,these data classification procedures are infeasible or unreliable,which causes difficulties in applying supervised learning algo-rithms. Therefore, the alternative solution is to use unsupervisedalgorithms. To use unsupervised algorithms, we adopted anunsupervised learning approach based on the local outlier factor(LOF)-identifying algorithm introduced by Breunig et al. [14].The LOF of every data point is computed using a novel conceptof the degree of similarity among the individual data points, andhence invalid data are detected as abnormal outlier factors.

The contribution of this paper is to develop an empiricalmethodology for raw wind data preprocessing. The only infor-mation required for this methodology is the aggregated windpower output of the wind farm collected from the SupervisoryControl And Data Acquisition (SCADA) system, which isavailable at the dispatch center, and the wind speed magnitudedata at the corresponding wind farm site. The availability ofwind farm operation state records or wind turbine fault logs(which are not recorded or stored by most wind farm operators)will help improve the accuracy of the methodology. If these dataare unavailable, which is often the case, the methodology pro-posed in this paper has nonetheless been proved to be adequatefor the situation. The rest of this paper is organized as follows.Section II studies the properties of raw wind data and notes thepossible causes of data errors. Section III proposes the prepro-cessing methodology with an emphasis on the formulation ofthe LOF algorithm and the concept of the weighted distance.Section IV discusses the uncertainty management of the pro-posed algorithm. Section V presents test results and Section VIpresents the conclusion.

II. RAW WIND DATA PROPERTIES

All the data used in this paper are collected from a riversidewind farm in the Gansu Province of China, at a temporal reso-lution of 15 min. The wind farm contains 100 identical direct-driven magnet wind turbines, each rated at 1500 kW. Wind

TABLE IRAW DATA CLASSIFICATION

Fig. 1. Raw scatter plot of wind farm output and wind speed.

TABLE IISAMPLE OF INVALID WIND DATA

speed represents the value measured at the wind meteorologicalmast (there is only one meteorological mast at the wind farm),and wind power denotes the aggregated generation from theentire wind farm.

Raw wind data can be divided into six categories accordingto the wind speed and power values, as shown in Table I. Fig. 1and Table II show some possible examples of invalid data in

Page 3: Zheng 2015

ZHENG et al.: RAW WIND DATA PREPROCESSING 13

Fig. 2. Distribution of raw wind data: (a) period from 10/1/2010 to 1/31/2011;(b) period from 2/1/2011 to 5/31/2011; and (c) period from 6/1/2011 to9/30/2011.

scatter plot and numerical format, respectively. The data werecollected from the period 11/21/2012 12:00 A.M. to 5/19/20138:00 A.M. Invalid data include primarily missing data, con-stant data, exceeding data, irrational data, and unnatural data.The existence of incorrect data might be due to sensor failuresor transmission errors. Unnatural data indicate data with lowpower output during high wind speed periods because not allturbines are always online. When the wind speed is higher thanthe cut-out speed, the wind turbine is forced to shut down bythe high-speed protection protocol. Another possible reason isthat the grid cannot absorb excess wind energy, and thus thewind turbine is shut down by dispatch instructions. In addi-tion, Fig. 1 shows that there is a slender, approximately verticalband around the cut-out speed because of wake effects, i.e., theturbines within a wind farm do not cut out together near the cut-out speed because the wind speeds at each turbine vary fromthe speed value measured at the mast. The wake effects datashould be categorized as valid data because they reflect the nat-ural output fluctuation property of the wind farm around thecut-out speed, which is important to the system operators. Itis, therefore, necessary to distinguish them from the unnaturaldata.

The constant data, missing data, and exceeding data can eas-ily be detected by the method proposed in [11]. Among theremaining data [the irrational, unnatural, and valid (IUV) data],the unnatural data and the irrational data should be given moreattention. To study the distribution of the different data cate-gories, the raw wind data in a complete calendar year (from10/1/2010 to 9/30/2011) from the same wind farm have beenanalyzed. The raw wind data distribution histogram (shown inFig. 2) indicates that the number of invalid data points (includ-ing the irrational and the unnatural data) is much smaller thanthe number of valid data points, even in winter, when more windcurtailments result in more unnatural data [see Fig. 2(a)]. Inaddition, Fig. 1 illustrates that the spatial distribution of thevalid data is much closer than the spatial distribution of theunnatural and irrational data. Within the areas of the unnaturaland the irrational data, the density is considerably lower than

Fig. 3. Structure of preprocessing system.

TABLE IIICONSTANT DATA PROCESSING ALGORITHM

TABLE IVMISSING DATA PROCESSING ALGORITHM

the density in the area of the valid data. Both the unnatural andthe irrational data can be considered outliers or noise comparedto the valid data.

Therefore, outlier detection, which tries to identify excep-tional cases that deviate substantially from the majority patterns[15], can be used to exclude the unnatural and the irrationaldata. Furthermore, from the simplicity point of view, as a typeof unsupervised learning, outlier detection can learn relation-ships and structure from the attributes of the data themselves[16], so that the classification step in [12] and [13] is no longernecessary.

III. PREPROCESSING METHODOLOGY

A. Wind Data Preprocessing Method

Fig. 3 shows the structure of the proposed preprocessingmethod. The constant data processing block, the missing dataprocessing block, and the physical range check block can eas-ily be implemented via several if–then judgment sentences, asshown in Tables III–V. Regarding imputation of the invalid

Page 4: Zheng 2015

14 IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

TABLE VEXCEEDING DATA PROCESSING ALGORITHM

data, the major imputation approaches [21] are to fill or pre-dict the missing values based on the nearby observed values.However, because the invalid data are often consistent fora relatively long time, there are insufficient data to make asmooth imputation, which may only introduce more incorrectdata to the database. Moreover, there is a large amount ofdata available, so we can obtain sufficiently interesting pat-terns from the remaining data that the effect of the patternlosses with the removal of the invalid data is limited. Therefore,no approximation is performed after removing the invaliddata.

Data scaling is performed by applying the followingequation:

x̄ =x

xr(1)

where x is a variable, xr is the rating, and x̄ is the normalizedvalue of the variable. Similarly, we use the “bar” notation todenote the normalized value of a variable.

For the irrational and the unnatural data processing block, theLOF identifying algorithm is applied.

B. LOF Algorithm

The LOF algorithm was first proposed by Breunig et al.in 2000 [14]. The algorithm tries to assign to each object inthe database a degree of being an outlier from the global per-spective. The degree is called the LOF of an object. The keydifference between the LOF algorithm and others is that LOFconsiders being an outlier to be a continuous property ratherthan a binary property. The formal definition of LOF is listed asfollows. More detailed discussion can be found in [14].

1) Definition 1: k-distance of an object x. For any pos-itive integer k, the k-distance of object x, denoted ask - distance(x), is defined as the distance d(x, y) betweenx and an object y ∈ D, such that for at least k objects z ∈D|{x}, d(x, z) ≤ d(x, y) holds, and for at most k − 1objects, d(x, y) < d(x, z) holds.

2) Definition 2: k-distance neighborhood of an object x.Given the k-distance of x, the k-distance neighborhoodof x contains every object whose distance from x is notgreater than the k-distance, as shown in (2). These objectsy are called the k-nearest neighbors of x

Nk(x) = {y ∈ D\{x}|d(x, y) ≤ k-distance(x)}. (2)

3) Definition 3: Reachability distance of an object x withrespect to object y. The reachability distance of object xwith respect to object y is defined as

Reach − distk(x, y) = max{k-distance(x), d(x, y)}.(3)

4) Definition 4: Local reachability density of an object x.The local reachability density of y is defined as

LrdMinPts(x) =|NMinPts(x)|∑

y∈NMinPts(x)

reach − distMinPts(x, y). (4)

5) Definition 5: LOF of an object x. The LOF of x isdefined as

LOFMinPts(x) =

∑y∈NMinPts(x)

IrdMinPts(y)

IrdMinPts(x)

|NMinPts(x)| . (5)

The parameter MinPts is used to define the concept ofdensity, i.e., specifying a minimum number of objects in theneighborhood of an object x. For most objects in a cluster, theoutlier factors are approximately equal to 1. For the outliers,the outlier factors are larger than 1. We can generally define anLOF-threshold value, which is determined by trial and error toobtain the best performance. Objects with outlier factors greaterthan the LOF-threshold value are outliers.

C. Similarity Measurement

Choosing the similarity/distance measurement or the rela-tionship model to describe data objects is critical in LOF [15].Because the hypothesis space is two-dimensional (2-D), themost commonly used similarity measurement is the Euclideandistance for the purpose of proper visualization

d(x, y) =

√(V̄ (x)− V̄ (y))

2+ (P̄ (x)− P̄ (y))

2(6)

where d(x, y) denotes the distance between object x and y.V̄ (x) and P̄ (x) represent the normalized value of wind speedand power output of object x, respectively.

However, after applying the LOF algorithm with theEuclidean distance to the data set, the test result indicates thatthe Euclidean distance measurement may fail to detect certainunnatural data (see Section IV) because the Euclidean distanceconsiders the wind speed dimension and the wind power dimen-sion equally. However, the wind power dimension has greaterimpact on the result because, although the unnatural data havecorrect wind speed values, their wind power values are unnat-ural (see Fig. 1). Therefore, the wind power dimension is theoutlier attribute in the unnatural data detection procedure.

Based on this understanding, the weighted distance is intro-duced to measure the similarity between objects, where theoutlier attribute is assigned a larger weight. To determine aproper form of the weight, prior knowledge about the windpower curve should be considered.

Many studies have reported the three-region theoretical windturbine power curve [17]–[19], as shown in Fig. 4. The scatter

Page 5: Zheng 2015

ZHENG et al.: RAW WIND DATA PREPROCESSING 15

Fig. 4. Wind turbine power curve.

plot of raw wind data in Fig. 1 also shows the same shape char-acteristics of the power curve. Hence, the points correspondingto the valid data are distributed near the power curve, whereasthe points corresponding to the unnatural and the irrational dataare far away. Therefore, the weight can be formulated based onthe difference between the measured and the true value of windpower, as follows.

1) When Vci ≤ V < Vr,

ω =

{1,

∣∣P̄T − P̄∣∣ ≤ 0.1∣∣P̄T − P̄

∣∣ /0.1, ∣∣P̄T − P̄∣∣ > 0.1.

(7)

2) When V < Vci or Vr ≤ V < Vco,

ω =

{1,

∣∣P̄T − P̄∣∣ ≤ 0.05∣∣P̄T − P̄

∣∣ /0.05, ∣∣P̄T − P̄∣∣ > 0.05.

(8)

3) When V ≥ Vco,

ω = 1 (9)

where Vci, Vr, and Vco represent the cut-in, rated, and cut-outspeed of the wind turbine. P̄T denotes the normalized true valueof wind power.

An object that is close to the power curve is defined as beinglocated in the [P̄T − 0.1, P̄T + 0.1] interval when Vci ≤ V <Vr and in the [P̄T − 0.05, P̄T + 0.05] interval when V < Vci

or V ≥ Vr, based on domain experiences and past studies.The weights assigned to these objects are 1, identical to theEuclidean distance. Additionally, the weights of the data whosewind speed values are larger than the cut-out speed, are alsoequal to 1, to extract the natural properties from the wake effectsdata. The weight of the other data is larger than 1, in propor-tion to the difference between the measured and the accuratevalue of the wind power. The farther away an object is located,the greater the weight, and the more likely the object is to bedetected as an outlier. Thus the weighted distance considerssticking close to the power curve as an auxiliary factor of beingvalid, which is achieved by applying the following equation:

d(x, y) =

√(V̄ (x)− V̄ (y))

2+ ωT · (P̄ (x)− P̄ (y))

2(10)

where the notations are identical to those in (6). T ≥ 0 is atuning parameter, to be determined separately.

However, the accurate wind power curve is ambiguous andimpossible to be determined. Therefore, we have to use an

approximate curve to approach the accurate power curve, i.e.,replace the true value P̄T in (7) and (8) by the approximatevalue P̄A. According to a review of the literature [1], [2], [17]–[20], there are two main types of approximations. The simplestway is to represent the whole wind farm with a single equivalentwind turbine and the corresponding approximate power curve.In general, the wind turbines of a wind farm are purchased fromthe same manufacturer and have the same technical parameters.Thus, the approximate power curve of the equivalent wind tur-bine can be established by multiplying the power curve modelof a single wind turbine by the total wind turbine number of thefarm. The two most commonly and easily used wind turbinepower curve models are given by (11) and (12), called the cubicmodel and the linear model, respectively. All the parameters aregiven by the manufacturers

P̄A =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0, 0 < V < Vci

V 3−V 3ci

V 3r −V 3

ci,Vci ≤ V < Vr

1, Vr ≤ V < Vco

0, Vco ≤ V

(11)

P̄A =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0, 0 < V < VciV−VciVr−Vci

,Vci ≤ V < Vr

1, Vr ≤ V < Vco

0, Vco ≤ V

(12)

where P̄A denotes the normalized approximate value of windpower. Other notations are identical to those in (8).

More accurate approximations can be found in [1] and[20]. According to the comparative study in [20], these pro-posed approximations result in error rates of approximately 1%,whereas the error of the cubic or the linear model is less than8%. However, the approximation methodologies are mainlybased on field measurement data or historical data, which arenot valid and ready in the preprocessing procedure. In otherwords, the cubic or the linear model is all we have when deal-ing with preprocessing problems, especially when consideringa brand new wind farm.

IV. UNCERTAINTY MANAGEMENT

A. Bias–Variance Tradeoff

Another significant aspect of data-mining methods is theuncertainty management. When evaluating the performanceof the similarity measurements, the difference between theapproximate and the true values of the wind power shouldbe considered. By analogy to the uncertainty management ofsupervised learning, we denote by variance the amount bywhich the detection result would change if we formulated theweighted distance using a different approximate power curve.Although different formulas of the power curve model willresult in different detection results, ideally the result shouldnot vary too much between approximations. If a similaritymeasurement has high variance, then small changes in theapproximation can result in large changes in the detectionresult, and vice versa. Hence, a similarity measurement with

Page 6: Zheng 2015

16 IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

low variance is more reliable, especially when the accuratepower curve is unavailable.

We denote bias by the detection failure introduced by thesimilarity measurement. For example, the Euclidean distancemeasurement assumes that the wind power and speed dimen-sions have the same impact on detection results. However,the wind power dimension has a greater impact on unnatu-ral data. Thus, no matter how many data we are given, it willnot be possible to produce an accurate detection using theEuclidean distance. The Euclidean distance results in high biasin unnatural data detection.

Generally, as the weight increases, the variance will increase,and the bias will decrease. The tuning parameter serves tocontrol the relative impact of the deviations on the detectionresult, i.e., trading off between variance and bias. When T = 0,the weight has no effect and the accuracy is limited, but noapproximate-true deviation is introduced into the procedure.However, as T → ∞, the impact of the weight increases, sothat the detection result is more constrained to the shape ofthe approximate curve, decreasing the bias between the detec-tion results and the approximate power curve and increasing thevariance of the model.

B. Parameter Selection

The method for selecting the MinPts and the LOF_thresholdcan be found in [14]. In this paper, MinPts equals 300 andLOF_threshold is 1.1. Selecting a good value for T is critical.However, unlike supervised learning, there are no outputs bywhich to supervise the learning; hence, the most common per-formance evaluation methods (such as cross validation) cannotbe used. Because the task is outlier detection in a 2-D space,the simplest way to evaluate the accuracy of the algorithm isby visual inspection. Another way is to choose the T valuethat results in the lowest bias plus variance value (denoted bybias + variance). As a general rule, as we increase the valueof T , the bias tends initially to decrease faster than the vari-ance increases. Consequently, the expected bias + variancedeclines. However, at some point, increasing the value of Thas little impact on the bias but starts to increase the vari-ance significantly. When this happens, the bias + varianceincreases.

According to the definition of bias in Section IV-A, bias mea-sures the detection performance of the algorithm, so the valueof bias is defined as ∞ if the algorithm fails to detect all irra-tional and unnatural data and as 0 the other way around. In thesame way, variance is used to assess the differences among var-ious power curve approximations, and the value of varianceis computed by comparing the detection results of differentapproximations applied. In this paper, we use the two modelsdescribed in (10) and (11), i.e., the cubic model and the linearmodel. The variance is low if most of the outliers detected usingdifferent approximation formulas coincide, which is computedby

Variance =(Ncubic −Ncommon) + (Nlinear −Ncommon)

Ncommon× 100

(13)

Fig. 5. Filtered scatter plot of wind data with the Euclidean distance.

Fig. 6. Outlier factor distributions.

where Ncubic and Nlinear denote the number of valid datapoints detected by the cubic and linear formula-weighted dis-tances, respectively. Ncommon illustrates the number of validdata points detected by both cubic and linear formula-weighteddistances.

V. TEST RESULTS AND DISCUSSION

A. Test Results

The data described in Section II are used to test the proposedmethod. The number of objects in the data set is 18 001, with1902 missing data points, 1694 constant data points, and 594exceeding data points. The rest of the 13 811 data points are theinput of the LOF algorithm. Fig. 5 shows the filtered scatter plotof the wind data with the Euclidean distance, and Fig. 6 showsthe corresponding outlier factor distribution. Fig. 5 indicatesthat some unnatural data cannot be detected using the Euclideandistance, as in some certain seasons, especially in winter, thegrid cannot absorb excess wind energy at valley load periods,and wind curtailment occurs so frequently that those unnaturaldata have high-density neighborhoods and small outlier factors.

Page 7: Zheng 2015

ZHENG et al.: RAW WIND DATA PREPROCESSING 17

TABLE VIRESULTS OF VARIOUS TUNING PARAMETERS

Fig. 7. Filtered scatter plot of wind data with the weighted distance: (a) cubicapproximation model, T = 0.5; (b) linear approximation model, T = 0.5;(c) cubic approximation model, T = 0.7; and (d) linear approximation model,T = 0.7.

We then test the LOF algorithm using weighted distance.Table VI shows the detection results of various tuning param-eter values. The middle two columns indicate the number ofobjects filtered by the cubic and the linear approximation mod-els, respectively. The fourth column specifies the commonobjects filtered by both models. The last column shows thebias + variance values, as defined in Section IV-B. When thevalue is less than 0.7, both the cubic and the linear approxi-mation models fail to detect all of the unnatural data. Whenthe value increases above 0.7, the cubic and the linear approx-imation models can detect all the unnatural data accurately. Asthe value increases, the number of common objects decreases.Fig. 7 shows the filtered scatter plot of the experiments applyingthe weighted distance using various tuning parameter values.Visual inspection can verify the descriptions above.

Good performance of a similarity measurement requires bothlow variance and low bias. Therefore, we choose 0.7 as thetuning parameter value to ensure detection accuracy and robust-ness. The value of the tuning parameter may vary with differentdatabases, but the determination procedure will not changemuch.

TABLE VIICONFUSION MATRIX OF THE WEIGHTED DISTANCE ALGORITHM

USING THE CUBIC APPROXIMATION MODEL, T = 0.5

TABLE VIIICONFUSION MATRIX OF THE WEIGHTED DISTANCE ALGORITHM

USING THE CUBIC APPROXIMATION MODEL, T = 0.7

TABLE IXCONFUSION MATRIX OF THE WEIGHTED DISTANCE ALGORITHM

USING THE CUBIC APPROXIMATION MODEL, T = 0.8

The approach is performed on a PC with an Intel i5 CPU3.19-GHz clock and 2 GB RAM. The algorithm is pro-grammed based on MATLAB. A single outlier factor computa-tion requires approximately 1 min. Because selecting the valueof parameter T requires several trials, the total computationtime is approximately 10 min.

B. Performance Validation

All the irrational and the unnatural data are labeled as invalidcompared to the valid data. Thus, the proposed algorithm ismore or less like a binary classifier. Analogous to any binaryclassifier, the algorithm can make two types of detection errors:1) the algorithm can incorrectly assign an object that is invalidto the valid category, which is denoted by nondetection or 2) thealgorithm can incorrectly assign an object that is valid to theinvalid category, which is denoted by false alarm. It is often ofinterest to determine which of these two types of errors is beingmade. The confusion matrix is a convenient way to display thisinformation. Tables VII–XII show the confusion matrix of thealgorithm using the cubic and linear approximation models atvarious T values.

Table VII reveals that when T = 0.5, the weighted distancealgorithm using the cubic approximation model detects that a

Page 8: Zheng 2015

18 IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, VOL. 6, NO. 1, JANUARY 2015

TABLE XCONFUSION MATRIX OF THE WEIGHTED DISTANCE ALGORITHM

USING THE LINEAR APPROXIMATION MODEL, T = 0.5

TABLE XICONFUSION MATRIX OF THE WEIGHTED DISTANCE ALGORITHM

USING THE LINEAR APPROXIMATION MODEL, T = 0.7

TABLE XIICONFUSION MATRIX OF THE WEIGHTED DISTANCE ALGORITHM

USING THE LINEAR APPROXIMATION MODEL, T = 0.8

total number of 3880 data points are invalid. Of these data,3551 are actually invalid, and 329 are not, i.e., are false alarms.Meanwhile, 224 genuinely invalid data points are not detectedby the algorithm, i.e., are nondetection errors. In the same way,we can tell how many false-alarm and nondetection errors aremade by each model.

As stated in Section III-A, only the valid data detected by thealgorithm are to be used in further research. The false-alarmdata points have a risk of losing wind speed-power patternsbecause the “invalid” data are removed, whereas nondetectionwill introduce incorrect information into further research, so thenondetection errors are more fatal than the false alarms. WhenT > 0.7, the nondetection is 0 in both cubic and linear approxi-mation models. As T increases, more useful patterns will be lostas the detected valid data decrease. Hence, T = 0.7 is the opti-mal value, and the accuracy of the algorithm is approximately95.45%. In addition, most false-alarm errors occur around theboundary of the valid data area, especially in the wake effectsdata area. This is because valid data located near the boundaryare more distributed and thus have less density. This is the com-mon limitation of all density-based outlier detection algorithms.

Future work will be done to improve the detection accuracy ofthe boundary data, especially the wake effects data.

The neural network-based method proposed in [12] reportsan accuracy of 96.5%. However, as training the neural net-work is time-consuming, the computation time is much longerthan the computation time of the methodology proposed in thispaper. Moreover, as stated in Section I, the neural networkmethod is a type of supervised learning algorithm, and the accu-racy depends on precise classification of the training set, whichis inconvenient and unavailable in most situations.

VI. CONCLUSION

In this paper, raw wind data properties were analyzed. Invaliddata can be categorized into five types. A wind data prepro-cessing methodology has been proposed. Because identifyingthe unnatural and the irrational data is challenging, this papertreats them as outliers and uses the LOF algorithm to detect andremove these outliers. To incorporate prior knowledge regard-ing the wind data, a new type of similarity measurement isdesigned and applied in the algorithm. Numerical experimentshave verified the effectiveness of the algorithm and the similar-ity measurement. The performance evaluation of the algorithmhas also been discussed.

One of the greatest advantages of the proposed methodologyis that it is a type of unsupervised learning algorithm. Therefore,it can detect and classify the raw data using solely the attributesof the data themselves. It is easier and more convenient to per-form in practice, especially when the operation records are notavailable. However, as there is no universal data-mining algo-rithm that can handle all problems, this methodology has itslimitations. First, the total number of the data points should notbe too small. An empirical minimum value is approximately1000. Second, if most of the data are invalid, the accuracy can-not be guaranteed. This situation indicates that either the dataacquisition and transmission system is broken down or manualactions are frequent. In short, the wind farm is faulty, and thedata acquired from it should not be used for research.

The data preprocessing method proposed in this paper canbe used for many purposes, not only wind-related applications.The idea of weighted distance can also be used in other outlieror cluster-detection algorithms to develop individual detectionalgorithms dedicated to specific applications.

REFERENCES

[1] M. Ali, I. Ilie, J. V. Milanovic, and G. Chicco, “Wind farm model aggre-gation using probabilistic clustering,” IEEE Trans. Power Syst., vol. 28,no. 1, pp. 309–316, Feb. 2013.

[2] M. Schlechtingen, I. F. Santos, and S. Achiche, “Using data-miningapproaches for wind turbine power curve monitoring: A comparativestudy,” IEEE Trans. Sustain. Energy, vol. 4, no. 3, pp. 671–679, Jul.2013.

[3] S. Kelouwani and K. Agbossou, “Nonlinear model identification of windturbine with a neural network,” IEEE Trans. Energy Convers., vol. 19,no. 3, pp. 607–612, Sep. 2004.

[4] A. Kusiak and Z. J. Zhang, “Short-horizon prediction of wind power:A data-driven approach,” IEEE Trans. Energy Convers., vol. 25, no. 4,pp. 1112–1122, Dec. 2010.

[5] A. Kusiak, H. Y. Zheng, and Z. Song, “Short-term prediction of wind farmpower: A data mining approach,” IEEE Trans. Energy Convers., vol. 24,no. 1, pp. 125–136, Mar. 2009.

Page 9: Zheng 2015

ZHENG et al.: RAW WIND DATA PREPROCESSING 19

[6] K. N. Filho, A. D. P. Lotufo, and C. R. Minussi, “Preprocessing data forshort-term load forecasting with a general regression neural network anda moving average filter,” in Proc. IEEE PowerTech Conf., Trondheim,Norway, 2011, pp. 1–7.

[7] P. Kumar, V. K. Chandna, and M. S. Thomas, “Intelligent algorithm forpreprocessing multiple data at RTU,” IEEE Trans. Power Syst., vol. 18,no. 4, pp. 1566–1572, Nov. 2003.

[8] G. Noriega and S. Pasupathy, “Adaptive estimation of noise covariancematrices in real-time preprocessing of geophysical data,” IEEE Trans.Geosci. Remote Sens., vol. 35, no. 5, pp. 1146–1159, Sep. 1997.

[9] J. S. Bartunek, M. Nilsson, B. Sallberg, and I. Claesson, “Adaptive finger-print image enhancement with emphasis on preprocessing of data,” IEEETrans. Image Process., vol. 22, no. 2, pp. 644–656, Feb. 2013.

[10] M. Fan, V. Vittal, G. T. Heydt, and R. Ayyanar, “Preprocessing uncertainphotovoltaic data,” IEEE Trans. Sustain. Energy, vol. 5, no. 1, pp. 351–352, Jan. 2014.

[11] M. Schlechtingen and I. F. Santos, “Comparative analysis of neural net-work and regression based condition monitoring approaches for windturbine fault detection,” Mech. Syst. Signal Process., vol. 25, no. 5,pp. 1849–1875, Jul. 2011.

[12] Z. Q. Liu, W. Z. Gao, Y. H. Wan, and E. Muljadi, “Wind power plant pre-diction by using neural networks,” in Proc. IEEE Energy Convers. Congr.Expo., 2012, pp. 3154–3160.

[13] Z. Y. Ding, “Study of short-term prediction of wind power,” M.S. thesis,Dept. Elect. Eng., South China Univ. Technol., Guangzhou, China, 2012.

[14] M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifyingdensity-based local outliers,” in Proc. Int. Conf. Manage. Data, 2000,pp. 93–104.

[15] J. W. Han, M. Kamber, and J. Pei, “Outlier detection,” in Data Mining:Concepts and Techniques, 3rd ed. San Mateo, CA, USA: MorganKaufmann, 2011, pp. 544–549.

[16] G. James, D. Witten, T. Hastie, and R. Tibshirani. (2014, Jan. 21).An Introduction to Statistical Learning with Applications in R (1st ed.)[Online]. Available: http://www.springer.com/series/417

[17] A. Kusiak, H. Y. Zheng, and Z. Song, “On-line monitoring of powercurves,” Renew. Energy, vol. 34, no. 6, pp. 1487–1493, Jun. 2009.

[18] M. Lydia, A. I. Selvakumar, S. S. Kuma, and G. E. P. Kumar, “Advancedalgorithms for wind turbine power curve modeling,” IEEE Trans. Sustain.Energy, vol. 4, no. 3, pp. 827–835, Jul. 2013.

[19] D. A. Spera, Wind Turbine Technology: Fundamental Concepts of WindTurbine Engineering. New York, NY, USA: ASME, 1994.

[20] B. P. Hayes, I. S. Ilie, A. Porpodas, S. Z. Djokic, and G. Chicco,“Equivalent power curve model of a wind farm based on field measure-ment data,” in Proc. IEEE PowerTech Conf., Trondheim, Norway, 2011,pp. 1–7.

[21] B. Efron, “Missing data, imputation, and the bootstrap,” J. Amer. Stat.Assoc., vol. 89, no. 426, pp. 463–475, Jun. 1994.

Le Zheng was born in China, in 1989. He receivedthe B.S. degree in electrical engineering fromTsinghua University, Beijing, China, in 2011. Heis currently pursuing the Ph.D. degree in electricalengineering at the same university.

His research interests include power system stabil-ity and control and large-scale wind energy integra-tion and control.

Wei Hu was born in China, in 1976. He received the B.S. and Ph.D. degreesin electrical engineering from Tsinghua University, Beijing, China, in 1998 and2002, respectively.

Currently, he is working as an Associate Professor with the Departmentof Electrical Engineering, Tsinghua University. His research interests includepower system modeling and simulation, security analysis, and smart control.

Yong Min was born in China, in 1963. He received the B.S. and Ph.D. degreesin electrical engineering from Tsinghua University, Beijing, China, in 1984 and1990, respectively.

He is currently a Professor with the Department of Electrical Engineering,Tsinghua University. His research interests include power system stability andcontrol.

Prof. Min is a Fellow of the IET.