stochastic configuration networks based adaptive storage … · 2020. 4. 22. · data applications,...

11
1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEE Transactions on Industrial Informatics 1 Stochastic Configuration Networks Based Adaptive Storage Replica Management for Power Big Data Processing Changqin Huang, Member, IEEE, Qionghao Huang, Student Member, IEEE, Dianhui Wang*, Senior Member, IEEE Abstract—In the power industry, processing business big data from geographically distributed locations, such as on- line line-loss analysis, has emerged as an important applica- tion. How to achieve highly efficient big data storage to meet the requirements of low latency processing applications is quite challenging. In this paper, we propose a novel adaptive power storage replica management system, named PARMS, based on stochastic configuration networks (SCNs), in which the network traffic and the data center (DC) geo-distribution are taken into consideration to improve data real-time pro- cessing. First, as a fast learning model with less computation burden and sound prediction performance, the SCN model is employed to estimate the traffic state of power data networks. Then, a series of data replica management algorithms is proposed to lower the effects of limited bandwidths and a fixed underlying infrastructure. Last, the proposed PARMS is implemented using data-parallel computing frameworks (DCFs) for the power industry. Experiments are carried out in an electric power corporation of 230 million users, CSG, and the results show that our proposed solution can deal with power big data storage efficiently and the job completion times across geo-distributed DCs are reduced by 12.19% on average. Index Terms—Power big data processing, stochastic config- uration networks, geo-distributed, cloud storage, data replica optimization. I. INTRODUCTION As a key national infrastructure, the power grid is rapidly moving towards automation and intelligence, and with the widespread use of IoT technologies, power grid data takes on typical big data features, such as the large volume of exponential growth [1]. The pro- cessing and utilization of big data plays an important role in smart power grid development. In many big data applications, near real-time data processing, such as online power line-loss calculation, is the most frequent and challenging. In addition, the underlying network architecture can not be casually changed for a rela- tively fixed infrastructure [2]. Therefore, based on the existing data storage infrastructure, how to meet the Changqin Huang is with Department of Educational Technology, Zhejiang Normal Univerity, Jinhua, China. Qionghao Huang is with School of Information Technology in Edu- cation, South China Normal University, Guangzhou, China. Dianhui Wang is with Department of Computer Science and Infor- mation Technology, La Trobe University, Melbourne, Australia. * Corresponding author ([email protected]). requirements of low latency data processing becomes increasingly significant to intelligent power data appli- cations. In general, there are two major strategies to solve the aforementioned issue originating from limited bandwidths and datacenter geo-distribution, i.e. task scheduling and data replica management [3]. Currently, a large body of research has been devoted to the task scheduling, however, the performance of these works greatly varies because of the difference in application types, so the former is not fully appropriate for a variety of power data applications [4], [5]. Nevertheless, the data replica strategy is emerging as a promising approach to provide underlying support for a diverse array of appli- cations [6]. When utilizing cloud computing with data replica, some solutions for data storage achieve a certain performance [7], [8], however, they are designed in a centralized fashion, in which all data is pulled to a single cluster for processing, rending them inefficient in many power data applications due to inadequate bandwidths or long latency [5]. Thereby, under the data-parallel computing frameworks (DCFs, e.g. MapReduce, Dryad, Spark), developing adaptive storage replica management over geo-distributed data centers(DCs) is a practical and preferable solution for power big data processing issues [9]. Traffic state prediction is critical to optimize data storage over distributed networks. A newly developed randomized learner model, termed stochastic configura- tion networks (SCNs) [10], can learn complex mappings from sampling data quickly with reasonably good mod- elling performance, compared to neural networks built using the well-known error back-propagation learning algorithm. Therefore, we adopt SCNs to implement this highly time-sensitive network traffic state prediction in a real-time fashion. Based on the fast learning benefits of SCNs, we develop PARMS by taking some power data processing characteristics (e.g. cyclicity, instantaneity) into account. Our proposed PARMS has been success- fully applied to a real-world application in the China Southern Power Grid (CSG) - one of the two biggest power corporations in China, providing electric power supplies for 230 million people from 5 provinces over one million square kilometers. Overall, this paper makes three main contributions:

Upload: others

Post on 29-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

1

Stochastic Configuration Networks BasedAdaptive Storage Replica Management for

Power Big Data ProcessingChangqin Huang, Member, IEEE, Qionghao Huang, Student Member, IEEE, Dianhui Wang*, Senior

Member, IEEE

Abstract—In the power industry, processing business bigdata from geographically distributed locations, such as on-line line-loss analysis, has emerged as an important applica-tion. How to achieve highly efficient big data storage to meetthe requirements of low latency processing applications isquite challenging. In this paper, we propose a novel adaptivepower storage replica management system, named PARMS,based on stochastic configuration networks (SCNs), in whichthe network traffic and the data center (DC) geo-distributionare taken into consideration to improve data real-time pro-cessing. First, as a fast learning model with less computationburden and sound prediction performance, the SCN model isemployed to estimate the traffic state of power data networks.Then, a series of data replica management algorithms isproposed to lower the effects of limited bandwidths anda fixed underlying infrastructure. Last, the proposed PARMSis implemented using data-parallel computing frameworks(DCFs) for the power industry. Experiments are carried out inan electric power corporation of 230 million users, CSG, andthe results show that our proposed solution can deal withpower big data storage efficiently and the job completiontimes across geo-distributed DCs are reduced by 12.19% onaverage.

Index Terms—Power big data processing, stochastic config-uration networks, geo-distributed, cloud storage, data replicaoptimization.

I. INTRODUCTIONAs a key national infrastructure, the power grid is

rapidly moving towards automation and intelligence,and with the widespread use of IoT technologies, powergrid data takes on typical big data features, such asthe large volume of exponential growth [1]. The pro-cessing and utilization of big data plays an importantrole in smart power grid development. In many bigdata applications, near real-time data processing, such asonline power line-loss calculation, is the most frequentand challenging. In addition, the underlying networkarchitecture can not be casually changed for a rela-tively fixed infrastructure [2]. Therefore, based on theexisting data storage infrastructure, how to meet the

Changqin Huang is with Department of Educational Technology,Zhejiang Normal Univerity, Jinhua, China.

Qionghao Huang is with School of Information Technology in Edu-cation, South China Normal University, Guangzhou, China.

Dianhui Wang is with Department of Computer Science and Infor-mation Technology, La Trobe University, Melbourne, Australia.

* Corresponding author ([email protected]).

requirements of low latency data processing becomesincreasingly significant to intelligent power data appli-cations. In general, there are two major strategies tosolve the aforementioned issue originating from limitedbandwidths and datacenter geo-distribution, i.e. taskscheduling and data replica management [3]. Currently,a large body of research has been devoted to the taskscheduling, however, the performance of these worksgreatly varies because of the difference in applicationtypes, so the former is not fully appropriate for a varietyof power data applications [4], [5]. Nevertheless, the datareplica strategy is emerging as a promising approach toprovide underlying support for a diverse array of appli-cations [6]. When utilizing cloud computing with datareplica, some solutions for data storage achieve a certainperformance [7], [8], however, they are designed in acentralized fashion, in which all data is pulled to a singlecluster for processing, rending them inefficient in manypower data applications due to inadequate bandwidthsor long latency [5]. Thereby, under the data-parallelcomputing frameworks (DCFs, e.g. MapReduce, Dryad,Spark), developing adaptive storage replica managementover geo-distributed data centers(DCs) is a practicaland preferable solution for power big data processingissues [9].

Traffic state prediction is critical to optimize datastorage over distributed networks. A newly developedrandomized learner model, termed stochastic configura-tion networks (SCNs) [10], can learn complex mappingsfrom sampling data quickly with reasonably good mod-elling performance, compared to neural networks builtusing the well-known error back-propagation learningalgorithm. Therefore, we adopt SCNs to implement thishighly time-sensitive network traffic state prediction ina real-time fashion. Based on the fast learning benefits ofSCNs, we develop PARMS by taking some power dataprocessing characteristics (e.g. cyclicity, instantaneity)into account. Our proposed PARMS has been success-fully applied to a real-world application in the ChinaSouthern Power Grid (CSG) - one of the two biggest powercorporations in China, providing electric power suppliesfor 230 million people from 5 provinces over one millionsquare kilometers.

Overall, this paper makes three main contributions:

Page 2: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

2

1) Practical, near real-time, SCN-based network traf-fic prediction is addressed for a time-sensitivescheduling situation in data-parallel computing.

2) Adaptive replica management approaches integrat-ing traffic prediction are proposed to lower the con-sequence of limited bandwidth and an underlyinginfrastructure for power big data processing.

3) A systematic, efficient cloud storage framework isdesigned and implemented to support power bigdata applications across geo-distributed DCs usingDCFs in the power industry.

The remainder of this paper is organized as follows:Section III reviews the related work, Section IV detailsthe proposed algorithms for PARMS, Section V describesthe system implementation, and Section VI reports theresults of the experiments, followed by the conclusion inSection VII.

II. RELATED WORKWith high performance and lower cost, cloud storage

is widely used in various fields, including the powerindustry[2], [7], [8]. Using core technologies in cloudcomputing, researchers have proposed a storage archi-tecture to monitor data management and a free tablestorage model in the NoSQL-based LaUD-MS system[11]. Jun Zhu et al. [12] proposed a framework-basedsoftware approach to address key utility big data storageand processing issues. Yongheng Li et al. [7] designed aHadoop and HDFS based big data system. In addition,cloud storage technologies have been applied into somespecific scenarios in the power industry [13], [14], [1].

Most of the research in this area is designed for acentralized computing system, which results in poorperformance when tasks of DCFs [15] dispatch acrossgeo-distributed DCs [5], which are common scenariosin the power industry. To lower the consequences oflimited bandwidth and to achieve low latency of dataaccess in such scenarios, some job scheduling algorithmsare proposed, such as WANalytics [16], Pixida [17],DESRP [18], Gaia [4], Flutter [5], etc. Pixida [17] employsa generalized min-k-cut algorithm to cut the DAG intoseveral parts for execution, and each part is executed inone data center. Gaia [4] adapts a new synchronizationmodel to avoid unnecessary global synchronization, andthus preserves the traffic across geo-distributed datacenters.

In addition to job scheduling, optimizing replica man-agement in distributed DCs is an important solution [3],and widely used in various domains, such as mobilenetworks [19], energy-saving management [20], videoservices [21], social networks [22], virtual machine man-agement [23]. To address the important placement prob-lem in replica technology, Xili Dai et al. [21] proposed ascheduling mechanism based on a tripartite graph andk-list algorithm, which optimizes access latencies whilemaintaining the benefit of low storage cost. Conductinganalytical studies and experiments that identify the per-formance issues of MapReduce in data centers, based

on a topology-aware heuristic algorithm, Incheon Paiket al. [24] presented an optimal replica data placement,minimizing global data access costs. Yi Wang et al. [25]proposed a K-SVD sparse representation technique toimprove the smart meter data compression ratio andclassification accuracy in the electricity industry. In re-lation to another important problem in replica technolo-gy, using the multithreaded and integrated maximumflow, Nihat Altiparmak et al. [26] present an optimalreplica selection algorithm to handle heterogeneous s-torage architectures, and compared with a maximumflow algorithm in a black-box manner, this approachreduces massive amount of unnecessary flow calcula-tions, achieving less latency in response. To reduce thedata availability time, and data access time, Nagarajan etal. [27] developed the replication algorithm that makesdecisions on the selection and placement of replica us-ing multiple criteria. This algorithm considers multipleparameters such as storage capacity, bandwidth, and thecommunication cost of geo-distributed sites.

However, most of these works focus on the opti-mization of general domains, with only a few focusingon storage scheduling optimization for geo-distributedpower big data systems, and furthermore the character-istics of the power industry received little attention intheir research.

III. PARMS: MOTIVATION AND SOLUTIONIn this section, we introduce the motivation of PARMS,

then give an overview of the proposed solution.

A. Characteristics of the Power IndustryTaking CSG for example, here we describe some dis-

tinct characteristics of big data processing in power in-dustries, which serve as our motivation for the proposedalgorithms or developed applications.

Headquarter

Province Branch

City Branch

Fig. 1. A brief view of the hierarchical division of CSG.

a) Hierarchicality of Data Centers: As Fig. 1 shows,CSG1 has strict hierarchical administrative and geo-graphic divisions, which consists of the headquarter,province branches, and city branches. Thus, as Fig. 2shows, CSG establishes relatively fixed geo-dispersedDCs with the same hierarchy as the geographic division,

1http://www.csg.cn

Page 3: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

3

which cannot casually alter the network topology due tosecurity or other concerns.

Headquarter

Guangdong Province Hainan Province

Shenzhen City XXX City Sanya City

… …

Fig. 2. The geo-distributed DCs are connected with private commer-cial high-speed data links, the topology and bandwidth of networkbetween DCs is relatively stable.

b) Instantaneity of Data Processing: Processing da-ta is cpu-intensive, memory-intensive, and cpu-memory-intensive. However, lots of power big data-based appli-cations require real-time operations with strict demand-s on low latency across geo-distributed DCs. Limitedbandwidths are likely to cause a bottleneck for thelow latency demand of big data processing applicationsacross geo-distributed DCs. Thus, with such a strictlystable structure of DCs, how to develop a practicalsolution to lower the implications caused by limitedbandwidths and geo-distributed DCs is a very urgentand challenging problem to be solved.

c) Cyclicity or Stability of Data Processing: Thelocations of sensors, the volume of data collected, andthe time to transfer the collected data are relativelystable. Taking a city named Zhongshan in GuangdongProvince for example, smart meters and other sensorstransfer collected data to DCs every 15mins and about2-3 Terabytes of textual data is collected every month. So,there is an obvious cyclicity of the time to collect datafrom sensors. With such potential cyclicity or patterns indata processing, it is possible to develop a solution basedon machine learning or artificial intelligence to minimizedata motion across geo-distributed DCs.

d) Heterogeneity of Computation and Storage Re-sources: Due to the high computation demand andstorage needed for power big data processing, manydevices with different capacities (e.g. CPU speed, IOPS)in computation or storage are continuously deployed tothe centers, resulting in a significant heterogeneity tothe performance of computing or storage servers, andthe inter-datacenter bandwidths are also heterogeneous.Such a heterogeneity results in some scalability andextendibility problems for the systems.

B. Overview of PARMSTo achieve the low latency processing of power big

data across the relatively fixed geo-distributed DCs, we

design a traffic prediction-based adaptive replica man-agement system, PARMS, for power big data processing.

Fig. 3 illustrates the four main components of thearchitecture of PARMS. The tracing daemons and threadsin the clusters monitor the system and collect runninginformation for GaExUnit; GaExUnit reprocesses the logsand forwards them to Intelligent Analysis System foranalysis. With the output from Intelligent Analysis System,the replica management component in GaExUnit runsthe algorithms to optimize the placement and selectionof replica, and Optimizer Daemon executes the optimalinstructions.

`

· Intelligent Meters Data

· Line losses Data

· Facilities Data

VSThVSTh VSTnVSTnVSTaVSTa

VCTaVCTa VCThVCThVCTh VCTmVCTmVCTm

Intelligent Analysis System

Po

we

r B

ig D

ata

Se

ts

Log

s

Da

tab

ase

Migration Migration

Replica Optimizer

Fig. 3. The architecture of PARMS: A network traffic prediction-based adaptive replica management system for a geo-distributed cloudstorage system in the power industry.

A summary of our solution, PARMS, is as follows:1) We adapt a fuzzy C-means-based device cluster-

ing algorithm to deal with the scalability and ex-tendibility problem brought by the heterogeneityin computing or the storage servers for the powerindustry: (Section IV-A).

2) We propose a practical, near real-time SCN-basednetwork traffic prediction mechanism with deepconsideration characteristics (such as, hierarchical-ity, cyclicity, stability) in the power industry (Sec-tion IV-B).

3) We design power big data processing-orientedreplica placement and selection algorithms to accel-erate data access across geo-distributed DCs, andsatisfy the low latency demand of data processingin the power industry (Section IV-D and IV-E).

IV. ALGORITHMS FOR PARMSWe introduce clustering algorithms based on fuzzy

C-means to classify the capacity of underlying devices,which provides valuable indexes of underlying devicesfor replica management or system management. Basedon a deep learning model and the application-level in-formation of the computing tasks, we employ a networktraffic load prediction framework to provide possiblenetwork traffic in the near future for replica manage-ment.

In the following, we detail the related algorithms forPARMS.

Page 4: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

4

A. Classifying Underlying Devices

To deal with the stability and extendibility problemcaused by the heterogenous performance of computingor storage servers, we categorize computing and stor-age servers into different logical groups by applying aclustering algorithm based on fuzzy C-means [28].

The power big data system structure can be simplydescribed as a directed graph G = (V , E) where the setof vertices V = CN

∪SN, CN = {cn1, ..., cni, ..., cnnc}

indicates computing nodes, SN = {sn1, ..., snj, ..., snns}indicates data storage nodes (also called data nodes), andE represents the transmission network links among thenodes. Assume that there are n computing nodes or datanodes in a system, with each node having np attributesthat determine the performance of the node (e.g. CPUspeed, IPOS). We denote p f i,k (1 ≤ k ≤ np, k ∈ N) asthe kth attribute of the ith node. Thus, all attributes ofthe ith node can be represented as a vector:

PFi = (λ1 p fi,1, ..., λk p fi,k, ..., λnp p fi,np), PFi ∈ Rnp , (1)

where λk is a coefficient of the jth attribute that normal-izes the various ranges of attribute values into 0-1.

Stacking all the attributes of n computation or storagenodes, we have a matrix:

PF = (λk p fi,k)n×np, PFi ∈ Rn×np , (2)

PF acts as the input for the clustering algorithm. Itsoutput is

LST = {LSTl |l ∈ [1, c]}, LCT = {LCTl |l ∈ [1, c]}, (3)

where a subscript of LCT or LST represents a nodecluster, as well as being an indicator of its capacity todeal with data.

B. SCN-Based Network Traffic Prediction

Recently, some researchers used application-level da-ta access patterns to predict traffic flow in data-parallel computing-based big data processing systems,and achieved better performance when predicting flowsize than traditional prediction algorithms [15], [29]. Asdiscussed in Section III-A, compared with other big dataprocessing platforms, there are more constraints on thearchitecture of DCs, the data collection methods, and theexecution of power big data processing tasks, such ashierarchicality, cyclicity and stability. These characteris-tics enable us to develop a mechanism to learn prioriknowledge of task execution times in a DAG provision.Therefore, we propose a practical mechanism to predictnetwork traffic over a period time in the future.

As Fig. 4 shows, the mechanism consists of three parts:1) An operator execution time fitting model-based

stochastic configuration network [10].2) A mechanism to extract DAG information from

data-parallel computing applications and calculateflow size information out of each stage.

Flo

w

Calc

ula

tor

S-pattern

DA

G

Extr

acto

r

Logs

Flow Size

Tra

ffic

Aggre

gato

rexecution_time

Query

TD

establish_time

Sta

ges

......

......

......

......

Traffic

Prediction

Training

Fig. 4. The mechanism of network traffic prediction.

3) A time series analysis of job execution logs to findcertain patterns in the job execution order.

In the following, we detail our framework and providedefinitions and relevant algorithms.

Definition 1. TD (Task Descriptor), A descriptor of acomputing task executed by a worker node, is denoted as:

TRC = ⟨IS, DT, Pri, WCID, JCID, CPU, Mem, OP⟩ (4)

where IS is an integer indicating the input size of acomputing task, DT is an integer indicating the type ofcollected data categorized by CSG, Pri is an integer forscheduling , CPU and Mem are computing resources as-signed by the scheduler, WCID is the clustering numberfrom the clustering algorithm, JCID is an integer indi-cating whether a computing task is processor-intensive,memory-intensive, or I/O-intensive, and OP is a codenumber representing operator (e.g. map(), union()) ofdata.

Definition 2. Task Event, a descriptor of an operator andthe volume of data to be processed, denoted as enk. The set oftask events with respect to operators provided by DCFs as anevent set of ε = {enk}, where we have k = 1, .., ne.

The enk can be represented by tuple (OP, IS), whichconsists of an operator and the size of the input data.Then the state of a task event at one time running inthe worker denoted as RTE(EN), where EN consists ofthe task event and its duration is denoted as EN ={(enk, tk)}. In particular, the value tk is zero when thetask event is complete or null. The state of work nodescni at time t can thus be denoted as:

Si,t = RTE(i, (enk, tk,i), ..., (enne , tne,i)). (5)

The fitting model accepts the changes in these statesthat cause the fluctuation in task execution time. Thechanges of variable values in Eqn. (5) over a time intervalis calculated as follows:

∆Si,∆t = ((Si,t ⊖ Si,t′ ), t)

= (RTE(i, (en1, ∆ti,1), ..., (enne , ∆ti,ne)), t).(6)

where ∆t means the time left to finish a task event.With the above definitions and formulation, we now

introduce our fitting model based on SCN. Comparedwith traditional machine learning methods (e.g. BP, RBF

Page 5: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

5

neural networks, SVM) [30] or other deep learning mod-els (e.g. LSTM [31]), SCN introduces little overhead to thesystem while achieving reliable prediction results, whichwe demonstrate in the experiment. The input and outputof the fitting model fi are simplified as:

Xi,t = (t, TE, ∆Si,∆t), Yi,t = tTE, (7)

We use a similar mechanism in [15] which servesas the second part of our prediction framework. Inaddition to providing operator logs for the fitting model,it outputs a tri-tuple (source, destination, flow size). Withthis information, we use the prediction results from theSCN model to predict the time for flow dumping intothe real-world network.

To mine certain or potential cyclicity in the powerindustry as discussed in Section III-A, we employ simplebut efficient sequence pattern mining algorithms whichserve as the third part of our network traffic predictionframework.

C. Replica Management

As an essential part of software systems for cloudstorage, replica technologies play a decisive role in im-proving concurrent access, the reliability and availabilityof data [32].

Replica management consists of replica generation,replica deletion, replica placement and replica selection.Several important factors exert significant influence onthe performance of replica management, such as thecapacity of replica hosts, the network state of replicalocations, and the characteristics of data-related appli-cations (such as data queries and data analysis).

In this part, we give a brief introduction of the maincomponents of replica management in PARMS. In thispaper, we detail the relevant algorithms or mechanismfor replica placement(Section IV-D) and replica selection(Section IV-E), which are specially designed for thepower industry. As for replica generation and replicadeletion, based on our previous work [33], we makesome alternations to use in this work.

a) Replica Generation: Power cloud storage systemsare highly dynamic computing environments, and datablocks of different data vary greatly in data access pat-terns or hotness. Based on the replica generation mecha-nism in our work [33], we update the relevant definitionof ideal access popularity and actual access popularity as wellas their computing methods. After such an alteration,this component can provide the others with API togenerate appropriate data blocks with new-coming orexisting data in power cloud storage systems.

b) Replica Deletion: After running for a certainamount of time, some outdated replica may exist in thesystem which should be set free to optimize data storageand network transmission. There are several ways todelete outdated replica, such as delayed deletion, offlinedeletion, and marked deletion [34]. Before deletion, thesystem needs to design a mechanism to decide what

kinds of replica are outdated, hence we have slightlymodified our previous work in [33] to achieve replicadeletion in PARMS.

c) Replica Placement: Based on the results of SCN-based network traffic prediction (Section IV-B), the dataaccess popularity calculations and other important fac-tors, we design a dynamic replica placement algorithmto optimize data storage and network transmission in thepower industry. The relevant definitions and algorithmsare detailed in Section IV-D.

d) Replica Selection: Running computing tasks onpower big data processing platforms is a typical multi-task scenario over geo-distributed DCs, and the prefer-ences of power big processing tasks differ greatly fromeach other. We develop a practical framework to trans-form this into a multi-objective optimization problem.The relevant definitions and algorithms are presented inSection IV-E.

D. Replica PlacementAs mentioned in Section III-A, there are some potential

patterns or a certain cyclicity in data parallel computingacross the relatively fixed geo-distributed DCs in a pow-er big data system. The frequency of data access differsgreatly from different applications, and results in degreesof hotness or coldness for data blocks. Therefore, we needto make optimal decisions on where to place replica andhow to select replica, by considering replica factors andstorage locations.

Data blocks with the same access frequency may havedifferent popularities, which vary according to differentcomputing tasks. Each data block and its replicas areassociated with a queue of timestamps that record thevalues of their access popularity. Specifically, we calcu-late the popularity:

heat0(bi) = 0, r f = k× Rt + f × Ft,

heatt+1(bi) =heatt(bi)× log2(eλ(Tt+1−Tt))−2 + r f

Z,

(8)

where heatt+1(bi) is the updated value of the accesspopularity of data block bi at time t + 1, the attenuationfunction (log2(eλ(Tt+1−Tt))−2 deteriorates the replica’s ac-cess popularity over time, with a cooling coefficient λ, k and f consonant coefficients with k ∈ (0, 1) andf ∈ (−1, 1), Rt the number of accesses at time t, Ft thenumber of accesses probably occurring in the forecastingsequence of I/O events over a period from SEQS, andZ is a normalization factor.

A replication factor is assigned to each data block inthe history logs of SEQS by using maximum likelihoodestimation. After the access popularity (also called hot-ness) of the data blocks is calculated, their relationshipis given as follows:

Req(bi) = heatT(bi|θ̂), (9)

where Rep(bi) is the replication factor of bi, and heatT(bi)is the access popularity of bi at time T. We use the

Page 6: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

6

maximum likelihood to estimate parameter θ ∈ Θ:

θ̂MLE = arg maxθ∈Θ

n

∑i=1

ln(heatT(bi|θ)), (10)

Therefore, the replication factor of each data block ina big data cloud storage system can be specified in agranular way.

Algorithm 1 Dynamic Replica PlacementInput: LST , SN]/*A set of n data nodes*/, B/*A set of

m data blocks*/, CN/*A set of g computing nodes*/Output: RP[n][m] //The matrix represents the layout of

m data blocks on n data nodes.1: for l from 1 to c do2: LSTl = {snl

j|i ∈ [i, Size(LSTl)} ← SBI(LSTl);3: S(LSTl)← SumFreeSpace(all snl

j ∈ LSTl)4: end for//Sort the data nodes within the same clus-

ter, and sum their free space.5: for i from 1 to m do6: DSpi = {b

pii |pi ∈ [1, p]} ← SortByTaskPre f (bi);

7: end for// Categorize data blocks by the preferenceof geo-distributed computing tasks.

8: for pi from 1 to p do9: Hpi = {Heat(bi)|bi ∈ DSpi, i ∈ [1, Size(DSpi]} ←

GetHotness(DSpi);10: DSpi = {bi|pi ∈ [1, Size(DSpi]} ← SBH(Hpi);11: end for//The sort of data blocks with the same DSpi

by their hotness.12: RF = {Rep(bi)|i ∈ [1, m]} ← (H, B)13: L = {Lτ,t|t ∈ [1, T]} ← DoForcast( fτ);// Forecast

the network traffic load14: for pi from 1 to p do15: for i from 1 to Size(DSpi) do16: CanN ← SelectNode(SN, LST);//Obtain the

set of candidate data nodes17: FNode← EstimateNetLoad(CanN, Lτ,t, Failover);18: Update(RP[n][m], FNode)19: end for20: end for21: return RP[n][m]

Thus, we have described our prediction model, thelogical groups of storage servers, the calculations ofaccess popularity and the replication factors. Now, wepresent a dynamic replica placement algorithm that in-creases the system throughput and data transfer rateby optimizing the usage of network transmission acrossgeo-distributed DCs. Algorithm 1 is the algorithm forreplica placement.

E. Replica SelectionAfter replica placement, selecting the best replica to

satisfy the instantaneity of data processing demandsas discussed in Section III-A is a challenging prob-lem in a multi-task scenario. In order to measure theserviceability of the replica, we select three important

metrics, response time, network traffic load, and reliability.They are weighted according to the different QoS re-quirements of data access for a given computing task,that is, w = (w1, w2, w3), where w1 + w2 + w3 = 1,(0 < wi < 1, i = 1, 2, 3).

a) Matrix of Selection: a possibility of replica selec-tion, denoted as PM. Assume that n computing nodesof a given computing task that requests the same replicaas a set RC = {rc1, rc2, ..., rcnrc} and m data nodes thatkeep the replica as a set RS = {rs1, rs2, ..., rsnrs}. The PMof n computing nodes and m replicas of the data blockcan be described as follows:

PM = RCT RS = (pmi,j)nrc×nrs . (11)

where pmi,j = 1 means that computing node rci requestsreplica rsj through data node j, and pmi,j = 0(1 ≤ i ≤nrc, 1 ≤ j ≤ nrs), otherwise. Since there is only onereplica for a request at one time, it is obvious that wehave ∑nrs

j=1 pmi,j,where1 ≤ i ≤ nrc.b) QoS1 of response time: The performance of data

transmission between nodes is mainly determined by thecapacity of transmission network between them. vi,j iscalculated by the historical network throughout NT, thestate of running network NV, and the IPOS of storageservers L:

vi,j =1

α′ × NT + β

′ × NV + γ′ × L

, (12)

where coefficients α′, β′, and γ

′are obtained by measur-

ing the correlations between NT, NV, and L from historylogs and current running information. Thus, the metricmatrix of QoS1 can be written as:

QoS1 = (vi,j)nrc×nrs . (13)

c) QoS2 of network traffic load: The network trafficload between the nodes across geo-distributed DCs isalso an important factor for selecting a replica. nli,j is anindicator of the network traffic load determined by thecurrent traffic load KNL and the predicted FNL from thefitting model fτ :

nli,j =1

µ× KNL + (1− µ)× FNL, (14)

where µ(0 ≤ µ ≤ 1) is set by examining history logs.Therefore, the metric matrix QoS2 is as follows:

QoS2 = (nli,j)nrc×nrs (15)

d) QoS3 of reliability: The high reliability of dataaccess is essential to success in completing computingtasks, especially for those with high priority and highfrequency. The rsr

j measures the reliability that mayinvolve the quantitative stability of the hosting nodedenoted as HP, the duration tsrv, the creation time tgenof the replica, and the current time of system tnow. It iscalculated as:

rsrj = ρ× HP + w× tsrv

tnow − tgen. (16)

Page 7: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

7

Therefore, the QoS3 of reliability can be formed as avector

QoS3 = (rsr1, rsr

2, ..., rsrnrs). (17)

e) Objective functions F1, F2, and F3: The differentvalues of PMnrc×nrs can be regarded as different possi-bilities of replica selection. The values of QoS1, QoS2,and QoS3 for each PMnrc×nrs are represented by

F1(PM) = e (PM ⊙QoS1) eT

F2(PM) = e (PM ⊙QoS2) eT

F3(PM) = e (PM ⊙QoS3) eT(18)

where e is a vector of all ones, PM is equivalent toPMnrc×nrs , F1(PM) is the value of QoS1 for PMnrc×nrs ,So are F2(PM) and F3(PM). The operator ⊙ is Hadamardproduct.

Algorithm 2 Replica Selection for Power Big Data CloudStorageInput: wOutput: PMoptimal

1: (w1, w2, w3)←W ;2: PM ← RCT RS//Initiate PM randomly.3: QoS1 ← (vi,j);//The configuration of QoS1.4: QoS2 ← (nli,j); //The configuration of QoS2.5: QoS3 ← (rsr

j ) ;//The configuration of QoS3.6: F(PM)← (F1, F2, F3)⊙W(PM);7: PMoptimal ← Solve(max F(PM));8: return PMoptimal

It is desirable to select replica when each objectivefunction reaches the maximum with a certain value ofPM. Further, different applications can set different pref-erences for QoSi(i = 1, 2, 3) by specifying wi. Thus, theoverall objective function for replica selection is obtainedas:

F(PM) = F1(w1 ⊙ PM) + F2(w2 ⊙ PM) + F3(w3 ⊙ PM). (19)

where we have W = (wi,j)nrc×3, and wj = (wi,j)nrc×1,(j = 1, 2, 3).

The maximum of F(PM) is to find the best selectionmatrix PMoptimal for the objective function. Thus, thesolution to replica selection is to find the approximateoptimal solution to max F(PM), and we implement aparallel genetic algorithm (kGA) on the CUDA architec-ture described in Petr’s work [35] to solve the problem.The whole algorithm is given in Algorithm 2.

V. SYSTEM IMPLEMENTATION

We now describe the implementation of the mainmodules of PARMS. We employ Ganglia [36] for sys-tem monitoring, which can run customized scripts withvery low per-node overheads and high concurrence, andintegrate this with Spark or Hadoop monitoring APIs.We also develop customized scripts to collect monitoringdata as the inputs to our algorithms, such as I/O event

logs, and the QoS preferences of computing tasks. Thecollected information is managed by PostgreSQL (9.4).

The implementation of replica management policiescomprises mainly of two parts: replica placement andreplica selection. We incorporate our replica placementand selection inside the Hadoop Distributed File System(HDFS 2.6.0). Table I shows the main APIs of replicamanagement policies, the Java version is 1.7.0 55. TheAPIs also work in the latter version such as 2.8.0, 2.7.4of Hadoop HDFS and Kudu 1.4.0 Client. The Pythonbased client scanner API will be implemented soon.

TABLE IREPLICA MANAGEMENT POLICIES APIS

Method Caller of Setterpublic class DynamicBlockPlacementPolicy extends BlockPlacementPolicyDefault dfs.block.replicator.classname

public static final ReplicaSelection QOS PREF client.scanner

For the convenience of the operators, we adapt boththe C/S and B/S manner to develop a GUI-based client(Microsoft Windows platform, Java, Web) and a CMD-Line based client (mainly used in Linux, shell). TheGUI-based client provides operators with a friendly,graphical interface to manage the PARMS. Fig. 5 showsa screenshot of the management client of the PARMS inWindows 7.

(a) C/S (b) B/S

Fig. 5. Screenshots of PARMS.

VI. PERFORMANCE EVALUATIONThis section describes the experimental platform, the

configurations of our experiments, and reports the ex-perimental results.

A. Experiment PlatformWe conduct the experiments on the geo-distributed

power big data processing system of CSG. For the imple-mentation and specification of the power big processingplatform in CSG, readers can refer to our publishedpaper [37], and a common latency-aware task schedulingstrategy is employed to dispatch data-parallel comput-ing tasks across geo-distributed DCs [3]. Table II showsthe available bandwidths between nodes across geo-graphically dispersed DCs in the experiments.

The computing tasks in the experiments are routinetasks or data mining programs in power big data sys-tems of CSG. Examples are the line losses calculationin real time, the analysis of the power consumptionbehavior of consumers, and the abnormality monitoringand alarm in power consumption. The volume of data

Page 8: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

8

TABLE IIGEO-DISTRIBUTED DATACENTERS OF CSG FOR PARMS

EVALUATION, L1 REPRESENTS FOR THE HEADQUARTER, L2FOR PROVINCE BRANCHES, AND L3 FOR CITY BRANCHES

HQ(L1) GD Province(L2) Zhongshan City(L3) Foshan City(L3)HQ(L1) 1.5Gpbs 136Mbps - -GD(L2) - 1.2Gbps - -

Zhongshan City(L3) - 85Mbps 1.01Gbps 43.5MbpsFoShan City (L3) - 78 Mbps - 1.1Gbps

TABLE IIICONFIGURATION OF GEO-DISTRIBUTED TASKS

Item Application ConfigurationTask A Line-loss Calculation Spark, 102GBTask B Power Consumption Analysis Hadoop, 482GBTask C WordCount Spark, 40GBTask D GrapX Spark, 4847571 nodes, 68993773 edges

for computing tasks on the experimental platform isabout 550G from different geo-distributed cloud systemin CSG, and some open datasets (WordCount, GraphX)are also employed to test the overhead of PARMS intro-duced to the systems (as Table III shown).

TABLE IVTRAINING PARAMETERS OF LSTM IN KERAS

Para. Conf. Para. Conf.model Sequential add1 LSTMadd2 Dense(1) dropout 0.2

batchsize 86 epochs 50loss mae optimizer adam

TABLE VTRAINING PARAMETERS OF SCN

Para. Conf. Para. Conf.L max 250 tol 0.001T max 600 Lambdas [0.5, 1, ..., 250]

batchsize 1 r [0.9, 0.99, ..., 0.999999]

B. Evaluation of SCN-Based Fitting ModelA practical traffic prediction algorithm is critical to

implement replica management policies. As discussedin Section IV-B, the key to predicting traffic in DCFsis the information of the execution times of tasks. Anexperiment is conducted to evaluate the performance ofthe proposed SCN-based fitting model. The comparisonis a competitive model based on LSTM. The configu-ration of the training parameters is shown in Tables IVand V. As shown in Fig. 6, the predicted results from theSCN-based fitting model are better than the LSTM-basedone under the same computing resources and time limit.Even if the fitting model malfunctions due to significantchanges in the fitting mappings, it costs little to rebuilda functional one.

C. Evaluations on Read Latency by Comparing DifferentPolicies

The experiment evaluates the read latency over geo-distributed DCs reduced by the proposed replica man-agement policies. Fig. 7 plots the reading times for dif-ferent sizes of data with respect to three kinds of replica

(a) The loss of LSTM-basedfitting model when training.

(b) The predicted executiontimes from the LSTM-based.

0 20 40 60 80 100 120

L

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

RM

SE

Training RMSE

(c) The loss of SCN-based fit-ting model when training.

0 20 40 60 80 100 120 140

Tasks

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

NO

RM

AL

IZA

TIO

N O

F E

XE

CU

TIO

N T

IME

Test Target

Model Output

(d) The predicted executiontimes from the SCN-based.

Fig. 6. The performance of the prediction of execution times of tasksby different fitting models.

management policies. From the measurement data of la-tency among nodes, both the default policy of HDFS andour proposed policy outperform the one without anypolicies. Our replica management policies achieve betterperformance, though the reading time still increases at alinear rate. This is, however, inevitable for data transfer,as it is limited by the network bandwidth and disktransfer rate. Overall, our algorithm is more effective indata access in the geo-distributed DCs. Therefore, it ismore suitable for power big data processing across geo-distributed DCs.

0

10

20

30

40

50

60

1 2 3 4 5 10 15 20 25 30 35 40

Tim

e(s

ec)

The size of requesting data(MB)

PARMS-none PARMS-hdfs PARMS

Fig. 7. The reading time for data of different sizes for three kinds ofreplica management policies (where the lower, the better). PARMS-noneis denoted as the system without any replica management policies,PARMS-hdfs is the default policy, and PARMS is our approach.

D. Performance Comparisons of Dynamic Replica PlacementStrategies

This experiment aims to evaluate the resultant en-hancement of the execution of geo-distributed comput-ing tasks by using the replica placement strategy and

Page 9: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

9

(a)The Average Execution Time of Jobs

0

100

200

300

400

500

600

700

800

900

50 100 150 200 250 300 350

Av

era

ge

Exe

cuti

on

Tim

e(m

inu

tes)

The number of jobs

dynamic-Strategy

default-Strategy

(a) The average execution time of jobs.

0

100

200

300

400

500

600

700

800

TasK C(WordCount) TasK D(Graphx)

Time(s)

Pure HDFS HDFS with PARMS

(b) Job completion time on DCFs (overhead test).(a)The Average Execution Time

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9

IOP

SR

ati

o(%

)

The statistics of different time periods

LST1

LST2

LST3

(c) The IOPS ratio of LSTs.Average Execution Time of Jobs

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9

Th

rou

gh

ou

t R

ati

o(%

)

Time 10mins

LST1

LST2

LST3

(d) The throughout ratio of LSTs.

Fig. 8. The performance of replica placement strategies. Our approachis denoted as dynamic-Strategy and the default as default-Strategy.

the classification of data nodes. The performance ismeasured by the average execution time of jobs. In theexperiment, the data nodes are categorized into threelogical storage areas, LST1, LST2, and LST3 (the smallervalue of subscript, the better performance its associatednode is). As shown in Fig. 8a, the dynamic-Strategy takesless than the average running time of executing jobs thandefault-Strategy. From Fig. 8b, we find out that the extratime introduced by PARMS is negligible.

We also collected the IOPS and I/O throughput infor-mation of the logical storage areas. Fig. 8c and Fig. 8d

show the statistical results. It is clear that LST1 accountsfor approximately 62%, LST2 for 30%, and LST3 for only8% of both IPOS and I/O throughput on average.

E. Performance Comparisons of Replica Selection StrategiesThis experiment validates whether the replica selec-

tion policy can satisfy the diversity of data access de-mands of data processing across geo-distributed DCs.The related QoS-based replica selection in [38] is em-ployed for the comparison. To validate our proposedQoS-based replica selection policy, the performance ofour algorithms is compared with JPET and ENS [38].From Fig. 9a, QoS-RS is about three quarters of default-RSin the execution time, and achieves an obvious improve-ment of ENS over default-RS, achieving a better resultthan mr-QoS. Furthermore, we also report the fluctuationof ENS for the algorithms within a certain period of thesystem runtime, as shown in Fig. 9b. It is obvious thatQoS-RS has a better utilization of network than df-RSand mr-QoS, though it may have an obvious jitter.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

JPET ENS

Ra

tio

Be

twe

en

S

ele

cti

on

Po

licie

s

df-RS mr-QoS QoS-RS

(a) The Ratio of selection policies.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

2500 5000 7500 10000 12500 15000 17500 20000 22500 25000

EN

S(%

)

System Runtime(ms)

df-RS mr-QoS QoS-RS

(b) ENS of selection polices during the runtime.

Fig. 9. The performance of the replica selection strategies. The defaultpolicy is denoted as df-RS, the related QoS-based policy as mr-QoS, andour proposed QoS-based policy as QoS-RS.

F. Job Complete Time Improvement Using PARMSWith the predicted information on the network traffic,

PARMS can optimize replica placement and selectionto promote the efficiency of data access and reducejob complete time. Fig. 10 plots the completion timeof jobs across geo-distributed DCs. Fig. 11 plots thecompletion time of Task C of different data sizes acrossgeo-distributed DCs, which shows the proposed replicamanagement system is more capable of dealing withpower big data compared with the others.

VII. CONCLUSIONWith the rapid development of smart power, the real-

time processing of power big data is becoming increas-ingly important. To achieve low latency processing un-der the conditions of limited bandwidths and a relatively

Page 10: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

10

0 50 100 150 200 250

Task A(12GB)

Task A(12GB)

Task B(18GB)

Task B(18GB)

Task C(15GB)

Task C(15GB)

Average job completion time (s)

Original

Optimized

Fig. 10. The average completion times of three kinds of tasks withinthe system under the original setting and optimized setting of replicapolicies. The job completion times are reduced by 11.82% to 12.56%after conducting an optimization using PARMS.

0

2

4

6

8

10

12

14

0 200 400 600 800 1000

Pe

rce

nt

of

Re

du

ced

Tim

e(%

)

Size of Task C (GB)

Fig. 11. The average completion times of Task C of different data sizewithin the system under the original setting and optimized settingof replica policies. The job completion times are still reduced at areasonable percentage when the data size of the tasks gets bigger afterconducting an optimization using PARMS.

fixed underlying infrastructure, we designed and imple-mented an adaptive replica management system, PARM-S, for geo-distributed power big data storage. Based ona near real-time SCN-based mechanism that is able topredict network traffic, efficient replica management ap-proaches are designed to optimize the replica placementand selection. A series of experiments were conducted onthe platform of an electric power corporation, CSG. Theexperiment results show that our replica managementpolicies can solve the network transmission bottleneck toa certain degree, and increase the computing throughputof geo-distributed power big data systems. The job com-pletion times across geo-distributed DCs are reduced by12.19% on average when using PARMS.

Our future work will develop adaptive replica gen-eration and deletion mechanisms for PARMS, and fur-ther integrate replica management policies with geo-distributed task scheduling.

ACKNOWLEDGMENTS

This work was supported by the National NaturalScience Foundation of China (No. 61877020), the Scienceand Technology Projects of Guangdong Province, China(No. 2014B010117007, 2018B010109002), and the Scienceand Technology Project of Guangzhou Municipality, Chi-na (No. 201904010393).

REFERENCES

[1] D. He, N. Kumar, S. Zeadally, and H. Wang, “Certificatelessprovable data possession scheme for cloud-based smart grid datamanagement systems,” IEEE Transactions on Industrial Informatics,vol. 14, no. 3, pp. 1232–1241, Mar. 2018.

[2] L. Jiang, L. D. Xu, H. Cai, Z. Jiang, F. Bu, and B. Xu, “An iot-oriented data storage framework in cloud computing platform,”IEEE Transactions on Industrial Informatics, vol. 10, no. 2, pp. 1443–1451, Feb. 2014.

[3] Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella,P. Bahl, and I. Stoica, “Low latency geo-distributed data analyt-ics,” in Proceedings of the 2015 ACM Conference on Special InterestGroup on Data Communication. ACM, Aug. 2015, pp. 421–434.

[4] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger,P. B. Gibbons, and O. Mutlu, “Gaia: Geo-distributed machinelearning approaching LAN speeds,” in Proceedings of 14th USENIXSymposium on Networked Systems Design and Implementation, Mar.2017, pp. 629–647.

[5] Z. Hu, B. Li, and J. Luo, “Time- and cost- efficient task schedulingacross geo-distributed data centers,” IEEE Transactions on Paralleland Distributed Systems, vol. 29, no. 3, pp. 705–718, Mar. 2018.

[6] S. Sun, W. Yao, and X. Li, “DARS: A dynamic adaptive replicastrategy under high load Cloud-P2P,” Future Generation ComputerSystems, vol. 78, pp. 31–40, Jan. 2018.

[7] Y. Li, Y. Wang, and L. Jin, “Design of electric power datamanagement system based on Hadoop,” in Proceedings of 4thInternational Conference on Machinery, Materials and InformationTechnology Applications. Atlantis Press, Jan. 2017, pp. 1090–1093.

[8] K. Jia, Y. Chen, T. Bi, Y. Lin, D. Thomas, and M. Sumner,“Historical-data-based energy management in a microgrid witha hybrid energy storage system,” IEEE Transactions on IndustrialInformatics, vol. 13, no. 5, pp. 2597–2605, Oct. 2017.

[9] W. Xiao, W. Bao, X. Zhu, and L. Liu, “Cost-aware big dataprocessing across geo-distributed datacenters,” IEEE Transactionson Parallel and Distributed Systems, vol. 28, no. 11, pp. 3114–3127,Nov. 2017.

[10] D. Wang and M. Li, “Stochastic Configuration Networks: Funda-mentals and Algorithms,” IEEE Transactions on Cybernetics, vol. 47,no. 10, pp. 3466–3479, Oct. 2017.

[11] X. D. Zhong, Yuand Huang and D. Liu, “NoSQL storage solutionfor massive equipment monitoring data management,” ComputerIntegrated Manufacturing Systems, vol. 12, pp. 3008–3016, Nov.2013.

[12] J. Zhu, E. Zhuang, J. Fu, J. Baranowski, A. Ford, and J. Shen,“A framework-based approach to utility big data analytics,” IEEETransactions on Power Systems, vol. 31, no. 3, pp. 2455–2462, Aug.2016.

[13] J. Chen, N. Liu, Y. Chen, and W. Qiu, “Power big data platformbased on hadoop technology,” in Proceedings of 6th InternationalConference on Machinery, Materials, Environment, Biotechnology andComputer. Atlantis Press, Jan. 2016, pp. 571–576.

[14] M. H. Yaghmaee, M. Moghaddassian, and A. Leongarcia, “Au-tonomous two-tier cloud-based demand side management ap-proach with microgrid,” IEEE Transactions on Industrial Informatics,vol. 13, no. 3, pp. 1109–1120, Oct. 2017.

[15] H. Wang, L. Chen, K. Chen, Z. Li, Y. Zhang, H. Guan, Z. Qi, D. Li,and Y. Geng, “FLOWPROPHET: Generic and accurate traffic pre-diction for data-parallel cluster computing,” in Proceedings of IEEE35th International Conference on Distributed Computing Systems, Jun.2015, pp. 349–358.

[16] A. Vulimiri, C. Curino, P. B. Godfrey, T. Jungblut, K. Karana-sos, J. Padhye, and G. Varghese, “WANalytics: Geo-distributedanalytics for a data intensive world,” in Proceedings of the 2015ACM SIGMOD International Conference on Management of Data,May 2015, pp. 1087–1092.

[17] K. Kloudas, M. Mamede, N. M. Preguica, and R. Rodrigues, “Pix-ida: Optimizing data parallel jobs in wide-area data analytics,”Very Large Data Bases, vol. 9, no. 2, pp. 72–83, Oct. 2015.

[18] Y. Liu, W. Wei, and R. Zhang, “DESRP: An efficient differen-tial evolution algorithm for stochastic demand-oriented resourceplacement in heterogeneous clouds,” Future Generation ComputerSystems, vol. 88, pp. 234–242, May 2018.

[19] K. Muralidhar and N. P. Bharathi, “An optimal replica reloca-tion scheme for improving service availability in mobile ad hocnetworks,” in Proceedings of 2014 IEEE International Conference on

Page 11: Stochastic Configuration Networks Based Adaptive Storage … · 2020. 4. 22. · data applications, near real-time data processing, such as online power line-loss calculation, is

1551-3203 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2019.2919268, IEEETransactions on Industrial Informatics

11

Computational Intelligence and Computing Research. IEEE, Dec.2014, pp. 1–4.

[20] S. Long, Y. Zhao, and W. Chen, “A three-phase energy-saving s-trategy for cloud storage systems,” Journal of Systems and Software,vol. 87, pp. 38–47, Jan. 2014.

[21] X. Dai, X. Wang, and N. Liu, “Optimal scheduling of dataintensive applications in cloud based video distribution services,”IEEE Transactions on Circuits & Systems for Video Technology, vol. 27,no. 99, pp. 73–83, May 2017.

[22] A. De Salve, B. Guidi, P. Mori, L. Ricci, and V. Ambriola, “Privacyand temporal aware allocation of data in decentralized onlinesocial networks,” in Proceedings of 12th International Conference onGreen, Pervasive, and Cloud Computing. Springer, May 2017, pp.237–251.

[23] C. Guerrero, I. Lera, B. Bermejo, and C. Juiz, “Multi-objective op-timization for virtual machine allocation and replica placement invirtualized Hadoop,” IEEE Transactions on Parallel and DistributedSystems, vol. 29, no. 11, pp. 2568–2581, Nov. 2018.

[24] W. Chen, I. Paik, and Z. Li, “Tology-aware optimal data placementalgorithm for network traffic optimization,” IEEE Transactions onComputers, vol. 65, no. 8, pp. 2603–2617, Aug. 2016.

[25] Y. Wang, Q. Chen, C. Kang, Q. Xia, and M. Luo, “Sparse and re-dundant representation-based smart meter data compression andpattern extraction,” IEEE Transactions on Power Systems, vol. 32,no. 3, pp. 2142–2151, Aug. 2017.

[26] N. Altiparmak and A. Tosun, “Multithreaded maximum flowbased optimal replica selection algorithm for heterogeneous stor-age architectures,” IEEE Transactions on Computers, vol. 65, no. 5,pp. 1543–1557, May 2016.

[27] V. Nagarajan and M. A. M. Mohamed, “A prediction-based dy-namic replication strategy for data-intensive applications,” Com-puters & Electrical Engineering, vol. 57, pp. 281–293, Jan. 2017.

[28] M. C. Hung and D. L. Yang, “An efficient fuzzy c-means clusteringalgorithm,” in Proceedings of 2001 IEEE International Conference onData Mining, Nov. 2001, pp. 225–232.

[29] Y. Peng, K. Chen, G. Wang, W. Bai, Y. Zhao, H. Wang, Y. Geng,Z. Ma, and L. Gu, “Towards comprehensive traffic forecasting incloud computing: Design and application,” IEEE/ACM Transac-tions on Networking, vol. 24, no. 4, pp. 2210–2222, Aug. 2016.

[30] T. P. Oliveira, J. S. Barbar, and A. S. Soares, “Computer networktraffic prediction: A comparison between traditional and deeplearning neural networks,” International Journal of Big Data Intelli-gence, vol. 3, no. 1, pp. 28–37, Sep. 2016.

[31] Z. Zhao, W. Chen, X. Wu, P. C. Chen, and J. M. Liu, “LSTM net-work: A deep learning approach for short-term traffic forecast,”IET Intelligent Transport Systems, vol. 11, pp. 68–75, Mar. 2017.

[32] S. Zaman and D. Grosu, “A distributed algorithm for the replicaplacement problem,” IEEE Transactions on Parallel and DistributedSystems, vol. 22, no. 9, pp. 1455–1468, Sep. 2011.

[33] C. Q. Huang, Y. Li, Y. Tang, and R. H. Huang, “A researchon replica management of cloud storage system for educationalresources,” Journal of Beijing University of Posts and Telecommuni-cations, vol. 2, pp. 93–97, Feb. 2013.

[34] T. T. Liu, C. Li, Q. C. Hu, and G. G. Zhang, “Multiple-replicasmanagement in the cloud environment,” Journal of ComputerResearch & Development, vol. 48, no. S3, pp. 254–260, Sep. 2011.

[35] P. Pospichal, J. Jaros, and J. Schwarz, “Parallel genetic algorithmon the CUDA architecture,” in Proceedings of European conference onthe applications of evolutionary computation, Apr. 2010, pp. 442–451.

[36] M. L Massie, B. N Chun, and D. Culler, “The Ganglia distribut-ed monitoring system: Design, implementation and experience,”Parallel Computing, vol. 30, pp. 817–840, Jul. 2004.

[37] Q. Huang, J. Huang, X. Wang, C. Huang, and X. Heng, “Bigdata based service platform and its typical applications of electricpower industries,” in Proceedings of 2nd International Conference onEnergy, Power and Electrical Engineering, Nov. 2017, pp. 184–191.

[38] A. Jaradat, A. H. Muhamad Amin, M. Zakaria, and K. Golden,“An enhanced grid performance data replica selection schemesatisfying user preferences quality of service,” European Journalof Scientific Research, vol. 73, pp. 527–538, Mar. 2012.

Changqin Huang received his Ph.D. degree incomputer science and technology from ZhejiangUniversity, Hangzhou in 2005. Currently, heis a Professor at Zhejiang Normal University,Jinhua, China. He was a Visiting Scientist atZhejiang University, and an Honorary VisitingProfessor at La Trobe University. His researchinterests include service computing, big data,and semantic information retrieval. He is anawardee of ”Pearl River Scholar”.

Dr. Huang is a Senior Member of the ChinaComputer Federation, and a member of ACM and IEEE. He has servedas a Reviewer for several conferences and journals.

Qionghao Huang received his master degreeof Software Engineering in South China Nor-mal University in 2018. And now he is a s-tudent in School of Information Technologyin Education, South China Normal University,Guangzhou, China. His main research interestsinclude cloud computing and deep learningmethods. He is a student member of CCF andIEEE.

Dianhui Wang (M’03-SM’05) was awarded aPh.D. from Northeastern University, Shenyang,China, in 1995. From 1995 to 2001, he workedas a Postdoctoral Fellow with Nanyang Techno-logical University, Singapore, and a Researcherwith The Hong Kong Polytechnic University,Hong Kong, China. He joined La Trobe Uni-versity in July 2001 and is currently a Readerand Associate Professor with the Department ofComputer Science and Information Technology,La Trobe University, Australia. Since 2010, he

has been a visiting Professor at The State Key Laboratory of SyntheticalAutomation of Process Industries, Northeastern University, China.His current research interests include industrial artificial intelligence,industrial big data oriented machine learning theory (deep stochasticconfiguration networks, http://deepscn.com) and its applications inprocess industries, intelligent sensing technology and power systems.

Dr Wang is a Senior Member of IEEE, and serving as an Asso-ciate Editor for IEEE Transactions On Neural Networks and LearningSystems, IEEE Transactions On Cybernetics, Information Sciences, andWIREs Data Mining and Knowledge Discovery.