parallel sampling

8/12/2019 parallel sampling

1/17

JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.1 (1-17)

Available online at www.sciencedirect.com

ScienceDirect

Fuzzy Sets and Systems ()

www.elsevier.com/locate/fss

Parallel sampling from big data with uncertainty distribution

Qing He a, Haocheng Wang a,b,, Fuzhen Zhuang a, Tianfeng Shang a,b, Zhongzhi Shi a

a Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190,

Chinab University of Chinese Academy of Sciences, Beijing 100049, China

Abstract

Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to

proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of

data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data

with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling

method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of

Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling

from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all

the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on

MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have

been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that ouralgorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data

sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.

2014 Elsevier B.V. All rights reserved.

Keywords: Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce

1. Introduction

In many applications, data contain inherent uncertainty. The uncertainty phenomenon emerges owing to the lack

of knowledge about the occurrence of some event. It is encountered when an experiment (sampling, classification,

etc.) is to proceed, the result of which is not known to us; it may also refer to variety of potential outcomes, ways of

solution, etc.[1]. Uncertainty can also arise in categorical data, for example, the inherent structure of a given sample

set is uncertain for us. Moreover, the role of each sample in the inherent structure of the sample set is uncertain.

* Corresponding author at: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing

Technology, CAS, Beijing, 100190, China.

E-mail addresses:[email protected](Q. He),[email protected](H. Wang),[email protected](F. Zhuang),[email protected]

(T. Shang),[email protected](Z. Shi).

http://dx.doi.org/10.1016/j.fss.2014.01.016

0165-0114/

2014 Elsevier B.V. All rights reserved.
http://www.sciencedirect.com/http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://www.elsevier.com/locate/fssmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.elsevier.com/locate/fsshttp://dx.doi.org/10.1016/j.fss.2014.01.016http://www.sciencedirect.com/


2/17


2 Q. He et al. / Fuzzy Sets and Systems ()

Fuzzy set theory developed by Zadeh [2] is a suitable theory that proved its ability to work in many real applications.

It is worth noticing that fuzzy sets are a reasonable mathematical tool for handling the uncertainty in data [3].

With the rapid developments of data collection and distribution storage technologies, big data have become a

bigger-than-ever problem nowadays. Furthermore, there is a rapid growth in the hybrid study which connects the

uncertainty and big data together. And dealing with big data with uncertainty distribution is one of the most important

issues of big data research. Uncertainty in big data brings an interesting challenge as well as opportunity. Manystate-of-the-art methods can only handle small scale of data sets, therefore, parallel process big data with uncertainty

distribution is very important.

Sampling techniques, which play a very important role in all classification methods, have attracted amounts of

research in the area of machine learning and data mining. Furthermore, parallel sampling from big data with uncer-

tainty distribution becomes one of the most important tasks in the presence of the enormous amount of uncertain data

produced these days.

Hyper Surface Classification (HSC), which is a general classification method based on Jordan Curve Theorem, is

putforward by He et al. [4]. In this method, a model of hyper surface is obtained by adaptively dividing the samples

space in the training process, and then the separating hyper surface is directly used to classify large database. The data

are classified according to whether the number of intersections with the radial is odd or even. It is a novel approach

which has no need of either mapping from lower-dimensional space to higher-dimensional space or considering kernelfunction. HSC can efficiently and accurately classify two and three dimensional data. Furthermore, it can be extended

to deal with high dimensional data with dimension reduction [5]or ensemble techniques[6].

In order to enhance HSC performance and analyze its generalization ability, the notion of Minimal Consistent

Subset (MCS) is applied to the HSC method [7]. MCS is defined as consistent subset with a minimum number of

elements. For HSC method, the samples with the same category and falling into the same unit which covers at most

samples from the same category make an equivalent class. The MCS of HSC is a sample subset combined by selecting

one and only one representative sample from each unit included in the hyper surface. As a result, some samples in the

MCS are replaceable, while others are not, leading to the uncertainty of elements in MCS. MCS includes the same

number of elements, but the elements may be different samples. One of the most important features of MCS is that it

has the same classification model as the entire sample set, and can almost reflect its classification ability. For a given

data set, this feature is useful for obtaining the inherent structure which is uncertain for us. MCS is correspond to manyreal world problems, like classroom teaching. Specifically, the teacher explains some examples which is the Minimal

Consistent Subset of various types of exercises at length to his students, then the students having been inspired will be

able to solve the related exercises. However, the existing serial algorithm can only be performed on a single computer,

and it is difficult for this algorithm to handle big data with uncertainty distribution. In this paper, we propose a Parallel

Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distribution to get the MCS of the

original sample set whose inherent structure is uncertain for us. Experimental results in Section4show that PSHS can

deal with large scale data sets effectively and efficiently.

Traditional sampling methods on huge amount of data consume too much time or even cannot be applied to big

datadue to memory limitation. MapReduce is developed by Google as a software framework for parallel comput-

ing in a distributed environment [8,9]. It is used to process large amounts of raw data such as documents crawled

from web in parallel. In recent few years, many classical data preprocessing, classification and clustering algo-rithms have been developed on MapReduce framework. MapReduce framework is provided with dynamic flexibility

support and fault tolerance by Google and Hadoop. In addition, Hadoop can be easily deployed on commodity hard-

ware.

The remainder of the paper is organized as follows. In Section 2, preliminary knowledge is described, including

the HSC method, MCS and MapReduce. Section3implements the PSHS algorithm based on MapReduce framework.

In Section 4, we show our experimental results and evaluate our parallel algorithm in terms of effectiveness and

efficiency. Finally, our conclusions are stated in Section5.

2. Preliminaries

In this section we describe the preliminary knowledge, on which PSHS is based.


3/17


Q. He et al. / Fuzzy Sets and Systems () 3

2.1. Hyper surface classification

Hyper Surface Classification (HSC) is a general classification method based on Jordan Curve Theorem in Topology.

Theorem 1(Jordan Curve Theorem). LetX be a closed set in n-dimensional space R n. IfX is homeomorphic to a

sphereSn1, then its complementRn \ X has two connected components, one called inside, the other called outside.

According to the Jordan Curve Theorem, a surface can be formed in an n-dimensional space and used as the

separating hyper surface. For any given point, the following classification theory can be used to determine whether

the point is inside or outside the separating hyper surface.

Theorem 2 (Classification Theorem). For any given point x R n \X, x is inside of X the wind number i.e.

intersecting number between any radial from x and X is odd, and x is outside of X the intersecting number

between any radial from x andX is even.

The separating hyper surface is directly used to classify the data according to whether the number of intersections

with the radial is odd or even [4]. This classification method is a direct and convenience method. From the two

theorems above,X is regarded as the classifier, which divides the space into two parts. And the classification process

is very easy just by counting the intersecting number between a radial from the sample point and the classifier X . It

is a novel approach that has no need of making mapping from lower-dimensional space to higher-dimensional space.

HSC has no need of kernel function. Furthermore, it can directly solve the non-linear classification problem via the

hyper surface.

2.2. Minimal consistent subset

To handle the problem of high computational demands of nearest neighbor (NN), many efforts have been made

for selecting a representative subset of the original training data, like the condensed nearest neighbor rule (CNN)

presented by Hart[10].For a sample set, a consistent subset is a subset which, when used as a stored reference set

for the NN rule, correctly classifies all of the remaining points in the sample set. And the Minimal Consistent Subset(MCS) is defined as consistent subset with a minimum number of elements. Harts method indeed ensures consistency,

but the condensed subset is not minimal, and is sensitive to the randomly picked initial selection and to the order of

consideration of the input samples. After that, a lot of work has been done to reduce the size of the condensed subset

[1116]. The MCS of HSC is defined as follows.

For a finite sample set S, suppose C is the collection of all subsets. And C C is a disjoint cover set forS, such

that each element in Sbelongs to one and only one member ofC . The MCS is a sample subset combined by choosing

one sample and only one sample from each member in the disjoint cover set C. For HSC, we call sample a and b

equivalent if they belong to the same category and fall into the same unit which covers at most samples from the same

category. And the points falling into the same unit form an equivalent class. The cover set C is the union set of all

equivalent classes in the hyper surface H. More specifically, let H be the interior ofH andu is a unit in H. The MCS

of HSC denoted by Smin|H is a sample subset combined by selecting one and only one representative sample fromeach unit included in the hyper surface, i.e.

Smin|H =

u H

{choosing one and only ones u} (1)

The computation method for the MCS of a given sample set is described as follows:

1) Input the samples, containingk categories and ddimensions. Let the samples be distributed within a rectangular

region.

2) Divide the rectangular region into

d 10 10 10 small regions called units.

3) If there are some units containing samples from two or more different categories, then divide them into smaller

units repeatedly until each unit covers at most samples from the same category.


4/17



Fig. 1. Fuzzy boundary set.

4) Label each unit with 1, 2, . . . , k , according to the category of the samples inside, and unite the adjacent units with

the same label into a bigger unit.

5) For each sample in the set, locate its position in the model, which means to figure out which unit it is located in.6) Combine samples that are located in the same unit into one equivalent class, then a number of equivalent classes

in different layers are got.

7) Pick up one sample and only one sample from each equivalent class to form the MCS of HSC.

The algorithm above is not sensitive to the randomly picked initial selection and to the order of consideration of

the input samples. And some samples in the MCS are replaceable, while others are not. Some close samples within

the same category and falling into the same unit are equivalent to each other in the building of the classifier, and each

of them can be picked randomly for the MCS. On the contrary, sometimes there can be only one sample in a unit, and

this sample plays a unique role in forming the hyper surface. Hence the outcome of MCS is uncertain for us.

Note that, different division granularities lead to different separating hyper surfaces and inherent structures. As

seen inFig. 1,each boundary denoted by dotted line (l1, l2, l3, etc.) may be used in the division process, and all thepossible separating hyper surfaces form a fuzzy boundary set. The samples in the fuzzy boundary set have different

memberships for the separating hyper surface used in the division process. Specifically, the samples lie in dotted line

l2 have maximum membership i.e. 1 for the separating hyper surface, while the samples lie in dotted line l1 and l3have uncertain memberships larger than 0.

For a specific sample set, the MCS almost reflects its classification ability. Any addition into the MCS will not

improve the classification ability, while every single deletion from MCS will lead to failure in testing accuracy. This

feature is useful for obtaining the inherent structure which is uncertain for us. However, all of the operations should be

executed in memory. When dealing with large scale data sets, the existing serial algorithm will encounter the problem

of insufficient memory.

2.3. MapReduce framework

MapReduce, as the framework showed inFig. 2,is a simplified programming model and computation platform for

processing distributed large scale data sets. It specifies the computation in terms of a map and a reduce function. The

underlying runtime system automatically parallelizes the computation across large scale cluster of machines, handles

machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

As its name shows, map and reduce are two basic operations in the model. Users specify a map function that

processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all

intermediate values associated with the same intermediate key.

All data processed by MapReduce are in the form of key-value pairs. The execution happens in two phases. In

the first phase, the map function is called once for each input record. For each call, it may produce any number of

intermediate key-value pairs. A map function is used to take a single key-value pair and output a list of new key-value

pairs. The type of output key and value can be different from input key and value. It could be formalized as:


5/17



Fig. 2. Illustration of the MapReduce framework: the map is applied to all input records, which generates intermediate results that are aggregated

by the reduce.

map::(key1, value1)list(key2, value2) (2)

In the second phase, these intermediate pairs are sorted and grouped by thekey2, and the reduce function is called

once for each key. Finally, the reduce function is given all associated values for the key and outputs a new list of

values. Mathematically, this could be represented as:

reduce::

key2, list(value2)

(key3, value3) (3)

The MapReduce model provides sufficient high-level parallelization. Since the map function only takes a single

record, all map operations are independent of each other and fully parallelizable. Reduce function can be executed in

parallel on each set of intermediate pairs with the same key.

3. Parallel sampling method based on hyper surface

In this section, the Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distri-

bution will be summarized. Firstly, we give the representation of hyper surface inspired by decision tree. Secondly, we

analyze the conversion from the serial parts to the parallel parts in the algorithm. Then we explain how the necessary

computations can be formalized as the map and reduce operations under MapReduce framework in detail.

3.1. Hyper surface representation

In fact, it is difficult to exactly represent a hyper surface ofRn space in the computer. Inspired by decision tree,

we can use some labeled regions to approximate a hyper surface. All N input features except the class attribute can

be considered to be real numbers in the range[0, 1). There is no loss of generality in this step. All physical quantities

must have some upper and lower bounds on their range, so suitable linear or non-linear transformations to the interval[0, 1) can always be found. The inputs being in the range [0, 1) means that these real numbers can be expressed as

decimal fractions. This is convenient because each successive digit position corresponds to a successive part of the

feature space.

Sampling is performed by simultaneously examining the most significant digit (MSD) of each of the N inputs.

This either yields the equivalent class directly (a leaf of the tree), or indicates that we must examine the next most

significant digit (descend down a branch of the tree) to determine the equivalent class. The next decimal digit then

either yields the equivalent class, or tells us to examine the following digit further, and so on. Thus sampling is

equivalent to find the region (a leaf node of the tree) representing an equivalent class, and pick up one sample and

only one sample from each region (a leaf node of the tree) to form the MCS of HSC. As sampling occurs one decimal

digit at a time, even the number having very long digits such as 0.873562, are handled with ease because the minimal

number of digits required for successful sampling is usually very little. Before data can be sampled, the decision tree

must be constructed as follows:


6/17



Table 1

9 samples of 4 dimension data.

Attribute 1 Attribute 2 Attribute 3 Attribute 4 Category

0.431 0.725 0.614 0.592 1

0.492 0.726 0.653 0.527 2

0.457 0.781 0.644 0.568 10.625 0.243 0.672 0.817 2

0.641 0.272 0.635 0.843 2

0.672 0.251 0.623 0.836 2

0.847 0.534 0.278 0.452 1

0.873 0.528 0.294 0.439 2

0.875 0.523 0.295 0.435 2

Table 2

The most significant digits of 9 samples.

Sample MSD Category

s1 4765 1s2 4765 2

s3 4765 1

s4 6268 2

s5 6268 2

s6 6268 2

s7 8524 1

s8 8524 2

s9 8524 2

1) Input all sample data, and normalize each dimension of them between [0, 1). The entire feature space is mapped

to the inside of a unit hyper-cube, referred as the root region.2) Divide the region into sub regions by getting the most significant digit of each of the N inputs.The arrangement

form of everyNdecimal digits can be viewed as a sub region.

3) For each sub region, if the samples in it belong to the same class, then label it with the samples class and attach

a flag P, which means this region is pure and we can construct a leaf node. Else turn to step 4).

4) Label this region with the majority class and attach a flag N, on behalf of impurity. Then go to step 2) to get the

next most significant digits of the input features until all the sub regions become pure.

From the above steps, we can get a decision tree that describes the inherent structure of the data set. Every node

of this decision tree can be regarded as a rule to classify the unseen data. For example, give a 4 dimension sample set

shown inTable 1.

As all the samples have been normalized in [0, 1), wecan skip the first step. Then, we get the most significantdigits of every sample, as shown inTable 2.

The samples falling into the region (6268) belong to the same category 2, which means region (6268) is pure. So

we label (6268) with category 2 and attach a flag P, then a rule (6268,2:P) is generated. Region (4765) has 2 samples

of category 1 and 1 sample of category 2. So we label it with category 1 and attach a flag N, leading to a new rule

(4765,1:N). And we must further divide it into sub regions. Similarly, for region (8524) we can get a rule (8524,2:N)

and also should divide it in the next circle.

Table 3shows the result of getting the next most significant digits of the samples falling in the impure regions. All

the sub regions of region (4765) and (8524) become pure, so we have rules (4375,1:P) and (7293,2:P) for the parent

region (8524), and rules (3219,1:P), (9252,2:P) and (5846,1:P) for the parent region (4765). The decision tree can be

constructed iteratively. The decision tree having the equivalent function of the generated rules is shown inFig. 3. We

notice that there is no need to construct the decision tree in memory, the rules can be generated straightforwardly,

which can be used to design the Parallel Sampling method based on Hyper Surface.


7/17



Table 3

The next most significant digits.

Sample MSD Category

s1 3219 1

s2 9252 2

s3 5846 1s7 4375 1

s8 7293 2

s9 7293 2

Fig. 3. An equivalent decision tree of the generated rules.

3.2. The analysis of MCS from serial to parallel

In the existing serial algorithm, the most common operation is to divide a region containing more than one class

into smaller regions and then determine whether a sub region is pure or not. If a sub region is pure, all the samples

that fall into it will not provide any useful information to construct other sub regions, thus they can be removed from

the samples. So to determine whether the sub regions having the same parent region are pure or not can be parallelexecuted. From Section2 we know that the process of MCS is to construct a multi-branched tree whose function is

similar to a decision tree. Therefore, we can construct one layer of the tree iteratively, from top to bottom, until each

leaf node that represents a region is pure.

3.3. The sampling process of PSHS

As the analysis above, PSHS algorithm needs three kinds of MapReduce job in iteration. In the first job, according

to the value of each dimension, the map function performs the procedure of assigning each sample to a region it

belongs to. While the reduce function performs the procedure of determining whether a region is pure or not, and

outputs a string representing the region and its purity attribute. After this job, a layer of the decision tree has been

constructed, and we must remove the unnecessary samples that are not useful to construct the next layer of the decisiontree, which is the task of the second job. In the third job, i.e. sampling job, the task of map function is to assign each

sample to a pure region it belongs to according to the rules representing pure regions. Since samples in the same pure

region are equivalent to each other in the building of the classifier, the reduce function can randomly pick one of them

for the MCS. Firstly, we present the details of the first job.

Map Step:The input data set is stored on HDFS which is a file system on hadoop as a sequence file ofkey, value

pairs, each of which represents a record in the data set. The keyis the offset in bytes of this record to the start point

of the data file, and the value is a string of the content of a sample and its class. The data set is split and globally

broadcast to all mappers. The pseudo code of map function is shown inAlgorithm 1. We can pass some parameters

to the job before the map function invoked. For simplicity, we use dimto represent the dimension of the input feature

except the class attribute, and layerto represent the level of the tree to be constructed.

InAlgorithm 1, the main goal is to get the corresponding region a sample belongs to, which is accomplished

from step 3 to step 9. A character : is appended after getting the digits of each dimension to indicate that a layer is


8/17



Algorithm 1TreeMapper (key,value)

Input:(key: offset in bytes; value: text of a record)

Output:(key: a string representing a region; value: the class label of the input sample)

1. Parse the stringvalueto an array, nameddata, of size dim and its class label namedcategory;

2. Set stringoutkeyas a null string;

3. fori = 1 to layerdo

4. forj= 0 todim do

5. appendoutkeywith getNum(data[j], i)

6. end for

7. ifi 0 do

3. numnum 10

4. i i 1

5. end while

6. get the integer part ofnum and assign it to a variableret

7. retret%10

8. returnthe corresponding character ofret

finished. We invoked a procedure getNum(num,n) in the process. Its function is getting the n-th digit ofnum, which

is described inAlgorithm 2.

Reduce Step:The input of the reduce function is the data obtained from the map function of each host. In reduce

function, we can count the number of samples for each class. If the class labels of all samples in a region are identical,

this region is pure. If a region is impure, we will label it with the majority category. First we pass all the class labels,

named categories, to the job as parameters which will be used in the reduce function. The pseudo code for reduce

function is shown inAlgorithm 3.Fig. 4shows the complete job procedure.

When the first job finished, we can get a set of regions that cover the whole samples. If a region is impure, we must

divide it into sub regions until the sub regions are all pure. Hence, if a region is pure, the samples that fall in it are

not needed and can be removed. Therefore, the second job can be referred to as a filter whose function is to remove

the unnecessary samples that are not useful to construct the next layer of the decision tree. We should read the impureregions into memory before we can decide whether a sample should be removed or not. We use a variable setto store

the impure regions. Then the second jobs mapper can be described in Algorithm 4.Hadoop provides a default reduce

implementation which outputs the result of the mapper, and it is what we adopt in the second job. The complete job

procedure can be seen inFig. 5.

The first job and second job run iteratively until the samples are all removed, in other words all the rules have been

generated. We can get several rule sets each of whom represents a layer of the decision tree. In the sampling job,

according to the rules representing pure regions, the map function performs the procedure of assigning each sample

to a pure region it belongs to. The rules representing pure regions should be read into memory before sampling. A list

variablerulesis used to store all these rules. Then the pseudo code for map function of the sampling job is shown in

Algorithm 5.

In reduce function of sampling job, we can randomly pick one sample from each pure region for the MCS. The

pseudo code of reduce function is described inAlgorithm 6.Fig. 6shows the complete job procedure.


9/17



Algorithm 3TreeReducer

Input:(key: a string representing a region;values: the list of class labels of all samples falling in this region)

Output:(key: identical tokey;value: the class label of this region plus its purity attribute)

1. Initial an arraycountto 0 with equal size to the number of all the class labels;

2. Initial a countertotalnumto 0 to record the number of samples in this region;

3. whilevalues.hasNext()do

4. get a class labelc fromvalues.next()

5. count[c] + +

6. totalnum + +

7. end while

8. find the majority classmaxfromcountand its corresponding index i

9. ifall samples belong tomax, i.e.

totalnum=count[i]then

10. purity P

11. else

12. purity N

13. end if

14. constructvalueas the comprising ofmaxand purity

15. output(key,value)

Fig. 4. Generating a layer of the decision tree.

Algorithm 4FilterMapper

Input:(key: offset in bytes;value: text of a record)

Output:(key: identical tovalueif this sample falls in an impure region; value: a null string)

1. ifthis sample matches a rule ofsetthen2. output(value, )

3. end if

4. Experiments

In this section, we demonstrate the performance of our proposed algorithm with respect to effectiveness and

efficiency by dealing with uncertainty distribution big data including real world data from UCI machine learning

repository and synthetic data. Performance experiments were run on a cluster of ten computers, six of them each has

four 2.8 GHz cores and 4 GB memory, the rest four each has two 2.8 GHz cores and 4 GB memory. Hadoop version

0.20.0 and Java 1.6.0_22 are used as the MapReduce system for all experiments.


10/17



Fig. 5. Filter.

Algorithm 5SamplingMapper

Input:(key: offset in bytes; value: text of a record)Output:(key: a string representing a pure region; value: identical tovalue)

1. Set stringpureRegionas a null string;

2. fori = 0 to (rules.length 1)do

3. ifthis sample matchesrules[i]then

4. pureRegion the string representing this region ofrules[i]

5. output(pureRegion,value)

6. end if

7. end for

Algorithm 6SamplingReducerInput:(key: a string representing a pure region;values: the list of all samples falling in this region)

Output:(key: one random sample of each pure region;value: a null string)

1. Set stringsampas a null string;

2. ifvalues.hasNext()then

3. samp values.next()

4. output(samp, )

5. end if

Fig. 6. Sampling.


11/17



4.1. Effectiveness

First of all, to illustrate the effectiveness of PSHS more vivid and clear, the following figures are listed. We use

two data sets from UCI repository as follows. Waveform data set has 21 attributes, 3 categories and 5000 samples.

The data set of Poker Hand contains 25,010 samples from 10 categories in ten dimensional space. Both data sets are

transformed into three dimensions by using the method in[5].The serial MCS computation method mentioned in[7]is executed to obtain the MCS of Poker Hand data set, and

then trained by HSC. The trained model of hyper surface is shown in Fig. 7.Furthermore, we adopt PSHS algorithm

to obtain the MCS of this data set. For comparison, the MCS of a given sample set obtained by PSHS is denoted by

PMCS, while the MCS of a given sample set obtained by the serial MCS computation method is denoted by MCS (the

same as follows). The PMCS is also used for training, whose hyper surface structure is shown in Fig. 8.

From the two figures above, we can see that the hyper surface structures between MCS and PMCS are totally the

same. They both have only one sample in each unit. No matter which we choose for training, either MCS or PMCS,

we get the same hyper surface maintaining identical distribution. Note that Waveform data set, and the same hyper

surface structures obtained by its MCS and PMCS are shown inFig. 9.

For a specific sample set, the Minimal Consistent Subset almost reflects its classification ability.Table 4shows the

classification ability of MCS and PMCS. All the data sets used here are got from UCI repository. From this table, we

can see that the testing accuracy obtained from PMCS is same with that obtained from MCS, which means that the

PSHS algorithm is totally consistent with the serial MCS computation method.

One notable feature of PSHSthe ability to deal with uncertainty distribution big data is shown inTable 5. We

obtain the synthetic three dimensional data by following the approach used in[4], and carry out the actual numerical

sampling and classification. The sampling time of PSHS is much better than that of the serial MCS computation

method, yet achieving the same testing accuracy.

4.2. Efficiency

We evaluate the efficient performance of our proposed algorithm in terms of speedup, scaleup and sizeup[17]when

dealing with uncertainty distribution big data. We use Breast Cancer Wisconsin data set from UCI repository, which

contains 699 samples from two different categories. The data set is firstly transformed into three dimensions by usingthe method in Ref.[5],and then replicate it to get 3 million, 6 million, 12 million, and 24 million samples respectively.

Speedup: In order to measure the speedup, we keep the data set constant and increase the number of cores in

the system. More specifically, we first apply PSHS algorithm in a system consisting of 4 cores, and then gradually

increase it. The core number of system varies from 4 to 32 and the size of the data set increases from 3 million to

24 million. The speedup given by the larger system withm cores is measured as:

Speedup(m)=run-time on 1 core

run-time onm cores(4)

The perfect parallel algorithm demonstrates linear speedup: a system with m times the number of cores yields a

speedup ofm. In practice, linear speedup is very difficult to achieve because of the communication cost and the skew

of the slaves. The slowest slave determines the total time needed. If not every slave needs the same time, we have this

skew problem.We have performed the speedup evaluation on data sets with different sizes. Fig. 10demonstrates the results. As the

size of the data set increases, the speedup of PSHS becomes approximately linear, especially when the data set is big

such as 12 million and 24 million. We also notice that when the data set is small such as 3 million, the performance

of 32-core system is not significantly improved compared to that of 16-core system, which is not accord with our

intuition. The reason is that the time of processing 3 million data set is not very bigger than the communication time

among the nodes and time occupied by fault-tolerance. However, as the data set increases, the processing time will

occupy the main part, leading to a good speedup performance.

Scaleup:Scaleup measures the ability to grow both the system and the data set size. It is defined as the ability of

an m-times larger system to perform an m-times larger job in the same run-time as the original system. The scaleup

metric is:

Scaleup(data, m)= run-time for processing data on 1 corerun-time for processingm dataon m cores

(5)


12/17



Fig. 7. Poker Hand data set and hyper surface structure obtained by its MCS.


13/17



Fig. 8. PMCS and hyper surface structure obtained by PMCS of Poker Hand data set.


14/17



Fig. 9. The hyper surface structures obtained by MCS and PMCS of Waveform data set.

Table 4

Comparison of classification ability.

Data set Sample No. MCS

sample No.

PMCS

sample No.

MCS

accuracy

PMCS

accuracy

Sampling

ratio

Iris 150 80 80 100% 100% 53.33%

Wine 178 129 129 100% 100% 72.47%

Sonar 208 186 186 100% 100% 89.42%

Wdbc 569 268 268 100% 100% 47.10%

Pima 768 506 506 99.21% 99.21% 65.89%

Contraceptive Method Choice 1473 1219 1219 100% 100% 82.76%

Waveform 5000 4525 4525 99.84% 99.84% 90.50%

Breast Cancer Wisconsin 9002 1243 1243 99.85% 99.85% 13.81%

Poker Hand 25,010 22,904 22,904 98.29% 98.29% 91.58%

Letter Recognition 20,000 13,668 13,668 90.47% 90.47% 68.34%

Ten Spiral 33,750 7285 7285 100% 100% 21.59%

Table 5

Performance comparison on synthetic data.

SampleNo.

Testingsample

No.

MCSsample

No.

PMCSsample

No.

MCSsampling

time

PMCSsampling

time

MCStesting

accuracy

PMCtesting

accuracy

1,250,000 5,400,002 875,924 875,924 14 m 21 s 1 m 49 s 100% 100%

5,400,002 10,500,000 1,412,358 1,412,358 58 m 47 s 6 m 52 s 100% 100%

10,500,000 22,800,002 6,582,439 6,582,439 1 h 30 m 51 s 12 m 8 s 100% 100%

22,800,002 54,000,000 12,359,545 12,359,545 3 h 15 m 37 s 25 m 16 s 100% 100%

54,000,000 67,500,000 36,582,427 36,582,427 7 h 41 m 35 s 48 m 27 s 100% 100%

To demonstrate how well the PSHS deals with uncertainty distribution big data when more cores of computers

are available, we have performed scalability experiments where we increase the size of the data set in proportion

to the number of cores. The data sets size of 3 million, 6 million, 12 million, 24 million are performed on 4, 8, 16

and 32 cores respectively.Fig. 11shows the performance results on these data sets. As the data set becomes larger,


15/17



Fig. 10. Speedup performance.

Fig. 11. Scaleup performance.

the scalability of PSHS drops slowly. It always maintains a value of scaleup higher than 84%. Obviously, the PSHSalgorithm scales very well.

Sizeup:Sizeup analysisholds the number of cores in the system constant, and grows the size of the data set. Sizeup

measures how much longer it takes on a given system, when the data set size is m-times larger than the original data

set. The sizeup metric is defined as follows:

Sizeup(data, m)=run-time for processingm data

run-time for processingdata(6)

To measure the performance of sizeup, we have fixed the number of cores to 4, 8, 16 and 32 respectively. Fig. 12

shows the sizeup results on different cores. When the number of cores is small such as 4 and 8, the sizeup perfor-

mances differ little. However, as more cores are available, the value of sizeup on 16 or 32 cores decreases significantly

compared to that of 4 or 8 cores on the same data sets. The graph demonstrates PSHS has a very good sizeup perfor-

mance.


16/17



Fig. 12. Sizeup performance.

5. Conclusion

With the advent of big data era, the demand for processing big data with uncertainty distribution is increasing. In

this paper, we present a Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty dis-

tribution to get the Minimal Consistent Subset (MCS) of the original sample set whose inherent structure is uncertain.

Our experimental evaluation on both real and synthetic data sets showed that our approach can not only obtain consis-

tent hyper surface structure and testing accuracy with the serial algorithm, but also perform efficiently according to the

speedup, scaleup and sizeup. Besides, our algorithm can process big data with uncertainty distribution on commodity

hardware efficiently. It should be noted that PSHS is a universal algorithm, but the features may be very different with

different classification methods. We will further conduct the experiments and consummate the parallel algorithm to

improve usage efficiency of computing resources in the future.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 61035003, 61175052,

61203297), National High-tech R&D Program of China (863 Program) (Nos. 2012AA011003, 2013AA01A606,

2014AA012205).

References

[1] V. Novk, Are fuzzy sets a reasonable tool for modeling vague phenomena?, Fuzzy Sets Syst. 156 (2005) 341348.

[2] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338353.

[3] D. Dubois, H. Prade, Gradualness, uncertainty and bipolarity: Making sense of fuzzy sets, Fuzzy Sets Syst. 192 (2012) 324.

[4] Q. He, Z. Shi, L. Ren, E. Lee, A novel classification method based on hypersurface, Math. Comput. Model. 38 (2003) 395407.

[5] Q. He, X. Zhao, Z. Shi, Classification based on dimension transposition for high dimension data, Soft Comput. 11 (2007) 329334.

[6] X. Zhao, Q. He, Z. Shi, Hypersurface classifiers ensemble for high dimensional data sets, in: Advances in Neural Networks ISNN 2006,

Springer, 2006, pp. 12991304.

[7] Q. He, X. Zhao, Z. Shi, Minimal consistent subset for hyper surface classification method, Int. J. Pattern Recognit. Artif. Intell. 22 (2008)

95108.

[8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (2008) 107113.

[9] R. Lammel, Googles mapreduce programming model-revisited, Sci. Comput. Program. 70 (2008) 130.

[10] P. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (1968) 515516.

[11] V. Cervern, A. Fuertes, Parallel random search and tabu search for the minimal consistent subset selection problem, in: Randomization andApproximation Techniques in Computer Science, Springer, 1998, pp. 248259.
http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6E6F76616B3230303566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A616465683139363566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6475626F6973323031326772616475616C6E657373s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030336E6F76656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib686532303037636C617373696669636174696F6Es1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6465616E323030386D6170726564756365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6C616D6D656C32303038676F6F676C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib68617274313936386E656172657374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib68617274313936386E656172657374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6C616D6D656C32303038676F6F676C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6465616E323030386D6170726564756365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib686532303037636C617373696669636174696F6Es1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030336E6F76656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6475626F6973323031326772616475616C6E657373s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A616465683139363566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6E6F76616B3230303566757A7A79s1


17/17



[12] B.V. Dasarathy, Minimal consistent set (mcs) identification for optimal nearest neighbor decision systems design, IEEE Trans. Syst. Man

Cybern. 24 (1994) 511517.

[13] P.A. Devijver, J. Kittler, On the edited nearest neighbor rule, in: Proc. 5th Int. Conf. on Pattern Recognition, 1980, pp. 7280.

[14] L.I. Kuncheva, Fitness functions in editing k-nn reference set by genetic algorithms, Pattern Recognit. 30 (1997) 10411049.

[15] C. Swonger, Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition, in: Frontiers of Pattern Recogni-

tion, 1972, pp. 511519.

[16] H. Zhang, G. Sun, Optimal reference subset selection for nearest neighbor classification by tabu search, Pattern Recognit. 35 (2002)14811490.

[17] X. Xu, J. Jochen, H. Kriegel, A fast parallel clustering algorithm for large spatial databases, in: High Performance Data Mining, Springer,

2002, pp. 263290.
http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646576696A76657231393830656469746564s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib4B756E636865766139376669746E65737366756E6374696F6E73s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib4B756E636865766139376669746E65737366756E6374696F6E73s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646576696A76657231393830656469746564s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1

parallel sampling

Documents