parallel sampling
TRANSCRIPT
-
8/12/2019 parallel sampling
1/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.1 (1-17)
Available online at www.sciencedirect.com
ScienceDirect
Fuzzy Sets and Systems ()
www.elsevier.com/locate/fss
Parallel sampling from big data with uncertainty distribution
Qing He a, Haocheng Wang a,b,, Fuzhen Zhuang a, Tianfeng Shang a,b, Zhongzhi Shi a
a Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190,
Chinab University of Chinese Academy of Sciences, Beijing 100049, China
Abstract
Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to
proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of
data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data
with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling
method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of
Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling
from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all
the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on
MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have
been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that ouralgorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data
sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.
2014 Elsevier B.V. All rights reserved.
Keywords: Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce
1. Introduction
In many applications, data contain inherent uncertainty. The uncertainty phenomenon emerges owing to the lack
of knowledge about the occurrence of some event. It is encountered when an experiment (sampling, classification,
etc.) is to proceed, the result of which is not known to us; it may also refer to variety of potential outcomes, ways of
solution, etc.[1]. Uncertainty can also arise in categorical data, for example, the inherent structure of a given sample
set is uncertain for us. Moreover, the role of each sample in the inherent structure of the sample set is uncertain.
* Corresponding author at: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing
Technology, CAS, Beijing, 100190, China.
E-mail addresses:[email protected](Q. He),[email protected](H. Wang),[email protected](F. Zhuang),[email protected]
(T. Shang),[email protected](Z. Shi).
http://dx.doi.org/10.1016/j.fss.2014.01.016
0165-0114/
2014 Elsevier B.V. All rights reserved.
http://www.sciencedirect.com/http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://www.elsevier.com/locate/fssmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.elsevier.com/locate/fsshttp://dx.doi.org/10.1016/j.fss.2014.01.016http://www.sciencedirect.com/ -
8/12/2019 parallel sampling
2/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.2 (1-17)
2 Q. He et al. / Fuzzy Sets and Systems ()
Fuzzy set theory developed by Zadeh [2] is a suitable theory that proved its ability to work in many real applications.
It is worth noticing that fuzzy sets are a reasonable mathematical tool for handling the uncertainty in data [3].
With the rapid developments of data collection and distribution storage technologies, big data have become a
bigger-than-ever problem nowadays. Furthermore, there is a rapid growth in the hybrid study which connects the
uncertainty and big data together. And dealing with big data with uncertainty distribution is one of the most important
issues of big data research. Uncertainty in big data brings an interesting challenge as well as opportunity. Manystate-of-the-art methods can only handle small scale of data sets, therefore, parallel process big data with uncertainty
distribution is very important.
Sampling techniques, which play a very important role in all classification methods, have attracted amounts of
research in the area of machine learning and data mining. Furthermore, parallel sampling from big data with uncer-
tainty distribution becomes one of the most important tasks in the presence of the enormous amount of uncertain data
produced these days.
Hyper Surface Classification (HSC), which is a general classification method based on Jordan Curve Theorem, is
putforward by He et al. [4]. In this method, a model of hyper surface is obtained by adaptively dividing the samples
space in the training process, and then the separating hyper surface is directly used to classify large database. The data
are classified according to whether the number of intersections with the radial is odd or even. It is a novel approach
which has no need of either mapping from lower-dimensional space to higher-dimensional space or considering kernelfunction. HSC can efficiently and accurately classify two and three dimensional data. Furthermore, it can be extended
to deal with high dimensional data with dimension reduction [5]or ensemble techniques[6].
In order to enhance HSC performance and analyze its generalization ability, the notion of Minimal Consistent
Subset (MCS) is applied to the HSC method [7]. MCS is defined as consistent subset with a minimum number of
elements. For HSC method, the samples with the same category and falling into the same unit which covers at most
samples from the same category make an equivalent class. The MCS of HSC is a sample subset combined by selecting
one and only one representative sample from each unit included in the hyper surface. As a result, some samples in the
MCS are replaceable, while others are not, leading to the uncertainty of elements in MCS. MCS includes the same
number of elements, but the elements may be different samples. One of the most important features of MCS is that it
has the same classification model as the entire sample set, and can almost reflect its classification ability. For a given
data set, this feature is useful for obtaining the inherent structure which is uncertain for us. MCS is correspond to manyreal world problems, like classroom teaching. Specifically, the teacher explains some examples which is the Minimal
Consistent Subset of various types of exercises at length to his students, then the students having been inspired will be
able to solve the related exercises. However, the existing serial algorithm can only be performed on a single computer,
and it is difficult for this algorithm to handle big data with uncertainty distribution. In this paper, we propose a Parallel
Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distribution to get the MCS of the
original sample set whose inherent structure is uncertain for us. Experimental results in Section4show that PSHS can
deal with large scale data sets effectively and efficiently.
Traditional sampling methods on huge amount of data consume too much time or even cannot be applied to big
datadue to memory limitation. MapReduce is developed by Google as a software framework for parallel comput-
ing in a distributed environment [8,9]. It is used to process large amounts of raw data such as documents crawled
from web in parallel. In recent few years, many classical data preprocessing, classification and clustering algo-rithms have been developed on MapReduce framework. MapReduce framework is provided with dynamic flexibility
support and fault tolerance by Google and Hadoop. In addition, Hadoop can be easily deployed on commodity hard-
ware.
The remainder of the paper is organized as follows. In Section 2, preliminary knowledge is described, including
the HSC method, MCS and MapReduce. Section3implements the PSHS algorithm based on MapReduce framework.
In Section 4, we show our experimental results and evaluate our parallel algorithm in terms of effectiveness and
efficiency. Finally, our conclusions are stated in Section5.
2. Preliminaries
In this section we describe the preliminary knowledge, on which PSHS is based.
-
8/12/2019 parallel sampling
3/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.3 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 3
2.1. Hyper surface classification
Hyper Surface Classification (HSC) is a general classification method based on Jordan Curve Theorem in Topology.
Theorem 1(Jordan Curve Theorem). LetX be a closed set in n-dimensional space R n. IfX is homeomorphic to a
sphereSn1, then its complementRn \ X has two connected components, one called inside, the other called outside.
According to the Jordan Curve Theorem, a surface can be formed in an n-dimensional space and used as the
separating hyper surface. For any given point, the following classification theory can be used to determine whether
the point is inside or outside the separating hyper surface.
Theorem 2 (Classification Theorem). For any given point x R n \X, x is inside of X the wind number i.e.
intersecting number between any radial from x and X is odd, and x is outside of X the intersecting number
between any radial from x andX is even.
The separating hyper surface is directly used to classify the data according to whether the number of intersections
with the radial is odd or even [4]. This classification method is a direct and convenience method. From the two
theorems above,X is regarded as the classifier, which divides the space into two parts. And the classification process
is very easy just by counting the intersecting number between a radial from the sample point and the classifier X . It
is a novel approach that has no need of making mapping from lower-dimensional space to higher-dimensional space.
HSC has no need of kernel function. Furthermore, it can directly solve the non-linear classification problem via the
hyper surface.
2.2. Minimal consistent subset
To handle the problem of high computational demands of nearest neighbor (NN), many efforts have been made
for selecting a representative subset of the original training data, like the condensed nearest neighbor rule (CNN)
presented by Hart[10].For a sample set, a consistent subset is a subset which, when used as a stored reference set
for the NN rule, correctly classifies all of the remaining points in the sample set. And the Minimal Consistent Subset(MCS) is defined as consistent subset with a minimum number of elements. Harts method indeed ensures consistency,
but the condensed subset is not minimal, and is sensitive to the randomly picked initial selection and to the order of
consideration of the input samples. After that, a lot of work has been done to reduce the size of the condensed subset
[1116]. The MCS of HSC is defined as follows.
For a finite sample set S, suppose C is the collection of all subsets. And C C is a disjoint cover set forS, such
that each element in Sbelongs to one and only one member ofC . The MCS is a sample subset combined by choosing
one sample and only one sample from each member in the disjoint cover set C. For HSC, we call sample a and b
equivalent if they belong to the same category and fall into the same unit which covers at most samples from the same
category. And the points falling into the same unit form an equivalent class. The cover set C is the union set of all
equivalent classes in the hyper surface H. More specifically, let H be the interior ofH andu is a unit in H. The MCS
of HSC denoted by Smin|H is a sample subset combined by selecting one and only one representative sample fromeach unit included in the hyper surface, i.e.
Smin|H =
u H
{choosing one and only ones u} (1)
The computation method for the MCS of a given sample set is described as follows:
1) Input the samples, containingk categories and ddimensions. Let the samples be distributed within a rectangular
region.
2) Divide the rectangular region into
d 10 10 10 small regions called units.
3) If there are some units containing samples from two or more different categories, then divide them into smaller
units repeatedly until each unit covers at most samples from the same category.
-
8/12/2019 parallel sampling
4/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.4 (1-17)
4 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 1. Fuzzy boundary set.
4) Label each unit with 1, 2, . . . , k , according to the category of the samples inside, and unite the adjacent units with
the same label into a bigger unit.
5) For each sample in the set, locate its position in the model, which means to figure out which unit it is located in.6) Combine samples that are located in the same unit into one equivalent class, then a number of equivalent classes
in different layers are got.
7) Pick up one sample and only one sample from each equivalent class to form the MCS of HSC.
The algorithm above is not sensitive to the randomly picked initial selection and to the order of consideration of
the input samples. And some samples in the MCS are replaceable, while others are not. Some close samples within
the same category and falling into the same unit are equivalent to each other in the building of the classifier, and each
of them can be picked randomly for the MCS. On the contrary, sometimes there can be only one sample in a unit, and
this sample plays a unique role in forming the hyper surface. Hence the outcome of MCS is uncertain for us.
Note that, different division granularities lead to different separating hyper surfaces and inherent structures. As
seen inFig. 1,each boundary denoted by dotted line (l1, l2, l3, etc.) may be used in the division process, and all thepossible separating hyper surfaces form a fuzzy boundary set. The samples in the fuzzy boundary set have different
memberships for the separating hyper surface used in the division process. Specifically, the samples lie in dotted line
l2 have maximum membership i.e. 1 for the separating hyper surface, while the samples lie in dotted line l1 and l3have uncertain memberships larger than 0.
For a specific sample set, the MCS almost reflects its classification ability. Any addition into the MCS will not
improve the classification ability, while every single deletion from MCS will lead to failure in testing accuracy. This
feature is useful for obtaining the inherent structure which is uncertain for us. However, all of the operations should be
executed in memory. When dealing with large scale data sets, the existing serial algorithm will encounter the problem
of insufficient memory.
2.3. MapReduce framework
MapReduce, as the framework showed inFig. 2,is a simplified programming model and computation platform for
processing distributed large scale data sets. It specifies the computation in terms of a map and a reduce function. The
underlying runtime system automatically parallelizes the computation across large scale cluster of machines, handles
machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
As its name shows, map and reduce are two basic operations in the model. Users specify a map function that
processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all
intermediate values associated with the same intermediate key.
All data processed by MapReduce are in the form of key-value pairs. The execution happens in two phases. In
the first phase, the map function is called once for each input record. For each call, it may produce any number of
intermediate key-value pairs. A map function is used to take a single key-value pair and output a list of new key-value
pairs. The type of output key and value can be different from input key and value. It could be formalized as:
-
8/12/2019 parallel sampling
5/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.5 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 5
Fig. 2. Illustration of the MapReduce framework: the map is applied to all input records, which generates intermediate results that are aggregated
by the reduce.
map::(key1, value1)list(key2, value2) (2)
In the second phase, these intermediate pairs are sorted and grouped by thekey2, and the reduce function is called
once for each key. Finally, the reduce function is given all associated values for the key and outputs a new list of
values. Mathematically, this could be represented as:
reduce::
key2, list(value2)
(key3, value3) (3)
The MapReduce model provides sufficient high-level parallelization. Since the map function only takes a single
record, all map operations are independent of each other and fully parallelizable. Reduce function can be executed in
parallel on each set of intermediate pairs with the same key.
3. Parallel sampling method based on hyper surface
In this section, the Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distri-
bution will be summarized. Firstly, we give the representation of hyper surface inspired by decision tree. Secondly, we
analyze the conversion from the serial parts to the parallel parts in the algorithm. Then we explain how the necessary
computations can be formalized as the map and reduce operations under MapReduce framework in detail.
3.1. Hyper surface representation
In fact, it is difficult to exactly represent a hyper surface ofRn space in the computer. Inspired by decision tree,
we can use some labeled regions to approximate a hyper surface. All N input features except the class attribute can
be considered to be real numbers in the range[0, 1). There is no loss of generality in this step. All physical quantities
must have some upper and lower bounds on their range, so suitable linear or non-linear transformations to the interval[0, 1) can always be found. The inputs being in the range [0, 1) means that these real numbers can be expressed as
decimal fractions. This is convenient because each successive digit position corresponds to a successive part of the
feature space.
Sampling is performed by simultaneously examining the most significant digit (MSD) of each of the N inputs.
This either yields the equivalent class directly (a leaf of the tree), or indicates that we must examine the next most
significant digit (descend down a branch of the tree) to determine the equivalent class. The next decimal digit then
either yields the equivalent class, or tells us to examine the following digit further, and so on. Thus sampling is
equivalent to find the region (a leaf node of the tree) representing an equivalent class, and pick up one sample and
only one sample from each region (a leaf node of the tree) to form the MCS of HSC. As sampling occurs one decimal
digit at a time, even the number having very long digits such as 0.873562, are handled with ease because the minimal
number of digits required for successful sampling is usually very little. Before data can be sampled, the decision tree
must be constructed as follows:
-
8/12/2019 parallel sampling
6/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.6 (1-17)
6 Q. He et al. / Fuzzy Sets and Systems ()
Table 1
9 samples of 4 dimension data.
Attribute 1 Attribute 2 Attribute 3 Attribute 4 Category
0.431 0.725 0.614 0.592 1
0.492 0.726 0.653 0.527 2
0.457 0.781 0.644 0.568 10.625 0.243 0.672 0.817 2
0.641 0.272 0.635 0.843 2
0.672 0.251 0.623 0.836 2
0.847 0.534 0.278 0.452 1
0.873 0.528 0.294 0.439 2
0.875 0.523 0.295 0.435 2
Table 2
The most significant digits of 9 samples.
Sample MSD Category
s1 4765 1s2 4765 2
s3 4765 1
s4 6268 2
s5 6268 2
s6 6268 2
s7 8524 1
s8 8524 2
s9 8524 2
1) Input all sample data, and normalize each dimension of them between [0, 1). The entire feature space is mapped
to the inside of a unit hyper-cube, referred as the root region.2) Divide the region into sub regions by getting the most significant digit of each of the N inputs.The arrangement
form of everyNdecimal digits can be viewed as a sub region.
3) For each sub region, if the samples in it belong to the same class, then label it with the samples class and attach
a flag P, which means this region is pure and we can construct a leaf node. Else turn to step 4).
4) Label this region with the majority class and attach a flag N, on behalf of impurity. Then go to step 2) to get the
next most significant digits of the input features until all the sub regions become pure.
From the above steps, we can get a decision tree that describes the inherent structure of the data set. Every node
of this decision tree can be regarded as a rule to classify the unseen data. For example, give a 4 dimension sample set
shown inTable 1.
As all the samples have been normalized in [0, 1), wecan skip the first step. Then, we get the most significantdigits of every sample, as shown inTable 2.
The samples falling into the region (6268) belong to the same category 2, which means region (6268) is pure. So
we label (6268) with category 2 and attach a flag P, then a rule (6268,2:P) is generated. Region (4765) has 2 samples
of category 1 and 1 sample of category 2. So we label it with category 1 and attach a flag N, leading to a new rule
(4765,1:N). And we must further divide it into sub regions. Similarly, for region (8524) we can get a rule (8524,2:N)
and also should divide it in the next circle.
Table 3shows the result of getting the next most significant digits of the samples falling in the impure regions. All
the sub regions of region (4765) and (8524) become pure, so we have rules (4375,1:P) and (7293,2:P) for the parent
region (8524), and rules (3219,1:P), (9252,2:P) and (5846,1:P) for the parent region (4765). The decision tree can be
constructed iteratively. The decision tree having the equivalent function of the generated rules is shown inFig. 3. We
notice that there is no need to construct the decision tree in memory, the rules can be generated straightforwardly,
which can be used to design the Parallel Sampling method based on Hyper Surface.
-
8/12/2019 parallel sampling
7/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.7 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 7
Table 3
The next most significant digits.
Sample MSD Category
s1 3219 1
s2 9252 2
s3 5846 1s7 4375 1
s8 7293 2
s9 7293 2
Fig. 3. An equivalent decision tree of the generated rules.
3.2. The analysis of MCS from serial to parallel
In the existing serial algorithm, the most common operation is to divide a region containing more than one class
into smaller regions and then determine whether a sub region is pure or not. If a sub region is pure, all the samples
that fall into it will not provide any useful information to construct other sub regions, thus they can be removed from
the samples. So to determine whether the sub regions having the same parent region are pure or not can be parallelexecuted. From Section2 we know that the process of MCS is to construct a multi-branched tree whose function is
similar to a decision tree. Therefore, we can construct one layer of the tree iteratively, from top to bottom, until each
leaf node that represents a region is pure.
3.3. The sampling process of PSHS
As the analysis above, PSHS algorithm needs three kinds of MapReduce job in iteration. In the first job, according
to the value of each dimension, the map function performs the procedure of assigning each sample to a region it
belongs to. While the reduce function performs the procedure of determining whether a region is pure or not, and
outputs a string representing the region and its purity attribute. After this job, a layer of the decision tree has been
constructed, and we must remove the unnecessary samples that are not useful to construct the next layer of the decisiontree, which is the task of the second job. In the third job, i.e. sampling job, the task of map function is to assign each
sample to a pure region it belongs to according to the rules representing pure regions. Since samples in the same pure
region are equivalent to each other in the building of the classifier, the reduce function can randomly pick one of them
for the MCS. Firstly, we present the details of the first job.
Map Step:The input data set is stored on HDFS which is a file system on hadoop as a sequence file ofkey, value
pairs, each of which represents a record in the data set. The keyis the offset in bytes of this record to the start point
of the data file, and the value is a string of the content of a sample and its class. The data set is split and globally
broadcast to all mappers. The pseudo code of map function is shown inAlgorithm 1. We can pass some parameters
to the job before the map function invoked. For simplicity, we use dimto represent the dimension of the input feature
except the class attribute, and layerto represent the level of the tree to be constructed.
InAlgorithm 1, the main goal is to get the corresponding region a sample belongs to, which is accomplished
from step 3 to step 9. A character : is appended after getting the digits of each dimension to indicate that a layer is
-
8/12/2019 parallel sampling
8/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.8 (1-17)
8 Q. He et al. / Fuzzy Sets and Systems ()
Algorithm 1TreeMapper (key,value)
Input:(key: offset in bytes; value: text of a record)
Output:(key: a string representing a region; value: the class label of the input sample)
1. Parse the stringvalueto an array, nameddata, of size dim and its class label namedcategory;
2. Set stringoutkeyas a null string;
3. fori = 1 to layerdo
4. forj= 0 todim do
5. appendoutkeywith getNum(data[j], i)
6. end for
7. ifi 0 do
3. numnum 10
4. i i 1
5. end while
6. get the integer part ofnum and assign it to a variableret
7. retret%10
8. returnthe corresponding character ofret
finished. We invoked a procedure getNum(num,n) in the process. Its function is getting the n-th digit ofnum, which
is described inAlgorithm 2.
Reduce Step:The input of the reduce function is the data obtained from the map function of each host. In reduce
function, we can count the number of samples for each class. If the class labels of all samples in a region are identical,
this region is pure. If a region is impure, we will label it with the majority category. First we pass all the class labels,
named categories, to the job as parameters which will be used in the reduce function. The pseudo code for reduce
function is shown inAlgorithm 3.Fig. 4shows the complete job procedure.
When the first job finished, we can get a set of regions that cover the whole samples. If a region is impure, we must
divide it into sub regions until the sub regions are all pure. Hence, if a region is pure, the samples that fall in it are
not needed and can be removed. Therefore, the second job can be referred to as a filter whose function is to remove
the unnecessary samples that are not useful to construct the next layer of the decision tree. We should read the impureregions into memory before we can decide whether a sample should be removed or not. We use a variable setto store
the impure regions. Then the second jobs mapper can be described in Algorithm 4.Hadoop provides a default reduce
implementation which outputs the result of the mapper, and it is what we adopt in the second job. The complete job
procedure can be seen inFig. 5.
The first job and second job run iteratively until the samples are all removed, in other words all the rules have been
generated. We can get several rule sets each of whom represents a layer of the decision tree. In the sampling job,
according to the rules representing pure regions, the map function performs the procedure of assigning each sample
to a pure region it belongs to. The rules representing pure regions should be read into memory before sampling. A list
variablerulesis used to store all these rules. Then the pseudo code for map function of the sampling job is shown in
Algorithm 5.
In reduce function of sampling job, we can randomly pick one sample from each pure region for the MCS. The
pseudo code of reduce function is described inAlgorithm 6.Fig. 6shows the complete job procedure.
-
8/12/2019 parallel sampling
9/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.9 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 9
Algorithm 3TreeReducer
Input:(key: a string representing a region;values: the list of class labels of all samples falling in this region)
Output:(key: identical tokey;value: the class label of this region plus its purity attribute)
1. Initial an arraycountto 0 with equal size to the number of all the class labels;
2. Initial a countertotalnumto 0 to record the number of samples in this region;
3. whilevalues.hasNext()do
4. get a class labelc fromvalues.next()
5. count[c] + +
6. totalnum + +
7. end while
8. find the majority classmaxfromcountand its corresponding index i
9. ifall samples belong tomax, i.e.
totalnum=count[i]then
10. purity P
11. else
12. purity N
13. end if
14. constructvalueas the comprising ofmaxand purity
15. output(key,value)
Fig. 4. Generating a layer of the decision tree.
Algorithm 4FilterMapper
Input:(key: offset in bytes;value: text of a record)
Output:(key: identical tovalueif this sample falls in an impure region; value: a null string)
1. ifthis sample matches a rule ofsetthen2. output(value, )
3. end if
4. Experiments
In this section, we demonstrate the performance of our proposed algorithm with respect to effectiveness and
efficiency by dealing with uncertainty distribution big data including real world data from UCI machine learning
repository and synthetic data. Performance experiments were run on a cluster of ten computers, six of them each has
four 2.8 GHz cores and 4 GB memory, the rest four each has two 2.8 GHz cores and 4 GB memory. Hadoop version
0.20.0 and Java 1.6.0_22 are used as the MapReduce system for all experiments.
-
8/12/2019 parallel sampling
10/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.10 (1-17)
10 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 5. Filter.
Algorithm 5SamplingMapper
Input:(key: offset in bytes; value: text of a record)Output:(key: a string representing a pure region; value: identical tovalue)
1. Set stringpureRegionas a null string;
2. fori = 0 to (rules.length 1)do
3. ifthis sample matchesrules[i]then
4. pureRegion the string representing this region ofrules[i]
5. output(pureRegion,value)
6. end if
7. end for
Algorithm 6SamplingReducerInput:(key: a string representing a pure region;values: the list of all samples falling in this region)
Output:(key: one random sample of each pure region;value: a null string)
1. Set stringsampas a null string;
2. ifvalues.hasNext()then
3. samp values.next()
4. output(samp, )
5. end if
Fig. 6. Sampling.
-
8/12/2019 parallel sampling
11/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.11 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 11
4.1. Effectiveness
First of all, to illustrate the effectiveness of PSHS more vivid and clear, the following figures are listed. We use
two data sets from UCI repository as follows. Waveform data set has 21 attributes, 3 categories and 5000 samples.
The data set of Poker Hand contains 25,010 samples from 10 categories in ten dimensional space. Both data sets are
transformed into three dimensions by using the method in[5].The serial MCS computation method mentioned in[7]is executed to obtain the MCS of Poker Hand data set, and
then trained by HSC. The trained model of hyper surface is shown in Fig. 7.Furthermore, we adopt PSHS algorithm
to obtain the MCS of this data set. For comparison, the MCS of a given sample set obtained by PSHS is denoted by
PMCS, while the MCS of a given sample set obtained by the serial MCS computation method is denoted by MCS (the
same as follows). The PMCS is also used for training, whose hyper surface structure is shown in Fig. 8.
From the two figures above, we can see that the hyper surface structures between MCS and PMCS are totally the
same. They both have only one sample in each unit. No matter which we choose for training, either MCS or PMCS,
we get the same hyper surface maintaining identical distribution. Note that Waveform data set, and the same hyper
surface structures obtained by its MCS and PMCS are shown inFig. 9.
For a specific sample set, the Minimal Consistent Subset almost reflects its classification ability.Table 4shows the
classification ability of MCS and PMCS. All the data sets used here are got from UCI repository. From this table, we
can see that the testing accuracy obtained from PMCS is same with that obtained from MCS, which means that the
PSHS algorithm is totally consistent with the serial MCS computation method.
One notable feature of PSHSthe ability to deal with uncertainty distribution big data is shown inTable 5. We
obtain the synthetic three dimensional data by following the approach used in[4], and carry out the actual numerical
sampling and classification. The sampling time of PSHS is much better than that of the serial MCS computation
method, yet achieving the same testing accuracy.
4.2. Efficiency
We evaluate the efficient performance of our proposed algorithm in terms of speedup, scaleup and sizeup[17]when
dealing with uncertainty distribution big data. We use Breast Cancer Wisconsin data set from UCI repository, which
contains 699 samples from two different categories. The data set is firstly transformed into three dimensions by usingthe method in Ref.[5],and then replicate it to get 3 million, 6 million, 12 million, and 24 million samples respectively.
Speedup: In order to measure the speedup, we keep the data set constant and increase the number of cores in
the system. More specifically, we first apply PSHS algorithm in a system consisting of 4 cores, and then gradually
increase it. The core number of system varies from 4 to 32 and the size of the data set increases from 3 million to
24 million. The speedup given by the larger system withm cores is measured as:
Speedup(m)=run-time on 1 core
run-time onm cores(4)
The perfect parallel algorithm demonstrates linear speedup: a system with m times the number of cores yields a
speedup ofm. In practice, linear speedup is very difficult to achieve because of the communication cost and the skew
of the slaves. The slowest slave determines the total time needed. If not every slave needs the same time, we have this
skew problem.We have performed the speedup evaluation on data sets with different sizes. Fig. 10demonstrates the results. As the
size of the data set increases, the speedup of PSHS becomes approximately linear, especially when the data set is big
such as 12 million and 24 million. We also notice that when the data set is small such as 3 million, the performance
of 32-core system is not significantly improved compared to that of 16-core system, which is not accord with our
intuition. The reason is that the time of processing 3 million data set is not very bigger than the communication time
among the nodes and time occupied by fault-tolerance. However, as the data set increases, the processing time will
occupy the main part, leading to a good speedup performance.
Scaleup:Scaleup measures the ability to grow both the system and the data set size. It is defined as the ability of
an m-times larger system to perform an m-times larger job in the same run-time as the original system. The scaleup
metric is:
Scaleup(data, m)= run-time for processing data on 1 corerun-time for processingm dataon m cores
(5)
-
8/12/2019 parallel sampling
12/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.12 (1-17)
12 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 7. Poker Hand data set and hyper surface structure obtained by its MCS.
-
8/12/2019 parallel sampling
13/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.13 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 13
Fig. 8. PMCS and hyper surface structure obtained by PMCS of Poker Hand data set.
-
8/12/2019 parallel sampling
14/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.14 (1-17)
14 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 9. The hyper surface structures obtained by MCS and PMCS of Waveform data set.
Table 4
Comparison of classification ability.
Data set Sample No. MCS
sample No.
PMCS
sample No.
MCS
accuracy
PMCS
accuracy
Sampling
ratio
Iris 150 80 80 100% 100% 53.33%
Wine 178 129 129 100% 100% 72.47%
Sonar 208 186 186 100% 100% 89.42%
Wdbc 569 268 268 100% 100% 47.10%
Pima 768 506 506 99.21% 99.21% 65.89%
Contraceptive Method Choice 1473 1219 1219 100% 100% 82.76%
Waveform 5000 4525 4525 99.84% 99.84% 90.50%
Breast Cancer Wisconsin 9002 1243 1243 99.85% 99.85% 13.81%
Poker Hand 25,010 22,904 22,904 98.29% 98.29% 91.58%
Letter Recognition 20,000 13,668 13,668 90.47% 90.47% 68.34%
Ten Spiral 33,750 7285 7285 100% 100% 21.59%
Table 5
Performance comparison on synthetic data.
SampleNo.
Testingsample
No.
MCSsample
No.
PMCSsample
No.
MCSsampling
time
PMCSsampling
time
MCStesting
accuracy
PMCtesting
accuracy
1,250,000 5,400,002 875,924 875,924 14 m 21 s 1 m 49 s 100% 100%
5,400,002 10,500,000 1,412,358 1,412,358 58 m 47 s 6 m 52 s 100% 100%
10,500,000 22,800,002 6,582,439 6,582,439 1 h 30 m 51 s 12 m 8 s 100% 100%
22,800,002 54,000,000 12,359,545 12,359,545 3 h 15 m 37 s 25 m 16 s 100% 100%
54,000,000 67,500,000 36,582,427 36,582,427 7 h 41 m 35 s 48 m 27 s 100% 100%
To demonstrate how well the PSHS deals with uncertainty distribution big data when more cores of computers
are available, we have performed scalability experiments where we increase the size of the data set in proportion
to the number of cores. The data sets size of 3 million, 6 million, 12 million, 24 million are performed on 4, 8, 16
and 32 cores respectively.Fig. 11shows the performance results on these data sets. As the data set becomes larger,
-
8/12/2019 parallel sampling
15/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.15 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 15
Fig. 10. Speedup performance.
Fig. 11. Scaleup performance.
the scalability of PSHS drops slowly. It always maintains a value of scaleup higher than 84%. Obviously, the PSHSalgorithm scales very well.
Sizeup:Sizeup analysisholds the number of cores in the system constant, and grows the size of the data set. Sizeup
measures how much longer it takes on a given system, when the data set size is m-times larger than the original data
set. The sizeup metric is defined as follows:
Sizeup(data, m)=run-time for processingm data
run-time for processingdata(6)
To measure the performance of sizeup, we have fixed the number of cores to 4, 8, 16 and 32 respectively. Fig. 12
shows the sizeup results on different cores. When the number of cores is small such as 4 and 8, the sizeup perfor-
mances differ little. However, as more cores are available, the value of sizeup on 16 or 32 cores decreases significantly
compared to that of 4 or 8 cores on the same data sets. The graph demonstrates PSHS has a very good sizeup perfor-
mance.
-
8/12/2019 parallel sampling
16/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.16 (1-17)
16 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 12. Sizeup performance.
5. Conclusion
With the advent of big data era, the demand for processing big data with uncertainty distribution is increasing. In
this paper, we present a Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty dis-
tribution to get the Minimal Consistent Subset (MCS) of the original sample set whose inherent structure is uncertain.
Our experimental evaluation on both real and synthetic data sets showed that our approach can not only obtain consis-
tent hyper surface structure and testing accuracy with the serial algorithm, but also perform efficiently according to the
speedup, scaleup and sizeup. Besides, our algorithm can process big data with uncertainty distribution on commodity
hardware efficiently. It should be noted that PSHS is a universal algorithm, but the features may be very different with
different classification methods. We will further conduct the experiments and consummate the parallel algorithm to
improve usage efficiency of computing resources in the future.
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Nos. 61035003, 61175052,
61203297), National High-tech R&D Program of China (863 Program) (Nos. 2012AA011003, 2013AA01A606,
2014AA012205).
References
[1] V. Novk, Are fuzzy sets a reasonable tool for modeling vague phenomena?, Fuzzy Sets Syst. 156 (2005) 341348.
[2] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338353.
[3] D. Dubois, H. Prade, Gradualness, uncertainty and bipolarity: Making sense of fuzzy sets, Fuzzy Sets Syst. 192 (2012) 324.
[4] Q. He, Z. Shi, L. Ren, E. Lee, A novel classification method based on hypersurface, Math. Comput. Model. 38 (2003) 395407.
[5] Q. He, X. Zhao, Z. Shi, Classification based on dimension transposition for high dimension data, Soft Comput. 11 (2007) 329334.
[6] X. Zhao, Q. He, Z. Shi, Hypersurface classifiers ensemble for high dimensional data sets, in: Advances in Neural Networks ISNN 2006,
Springer, 2006, pp. 12991304.
[7] Q. He, X. Zhao, Z. Shi, Minimal consistent subset for hyper surface classification method, Int. J. Pattern Recognit. Artif. Intell. 22 (2008)
95108.
[8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (2008) 107113.
[9] R. Lammel, Googles mapreduce programming model-revisited, Sci. Comput. Program. 70 (2008) 130.
[10] P. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (1968) 515516.
[11] V. Cervern, A. Fuertes, Parallel random search and tabu search for the minimal consistent subset selection problem, in: Randomization andApproximation Techniques in Computer Science, Springer, 1998, pp. 248259.
http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6E6F76616B3230303566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A616465683139363566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6475626F6973323031326772616475616C6E657373s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030336E6F76656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib686532303037636C617373696669636174696F6Es1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6465616E323030386D6170726564756365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6C616D6D656C32303038676F6F676C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib68617274313936386E656172657374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib68617274313936386E656172657374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6C616D6D656C32303038676F6F676C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6465616E323030386D6170726564756365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib686532303037636C617373696669636174696F6Es1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030336E6F76656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6475626F6973323031326772616475616C6E657373s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A616465683139363566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6E6F76616B3230303566757A7A79s1 -
8/12/2019 parallel sampling
17/17
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.17 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 17
[12] B.V. Dasarathy, Minimal consistent set (mcs) identification for optimal nearest neighbor decision systems design, IEEE Trans. Syst. Man
Cybern. 24 (1994) 511517.
[13] P.A. Devijver, J. Kittler, On the edited nearest neighbor rule, in: Proc. 5th Int. Conf. on Pattern Recognition, 1980, pp. 7280.
[14] L.I. Kuncheva, Fitness functions in editing k-nn reference set by genetic algorithms, Pattern Recognit. 30 (1997) 10411049.
[15] C. Swonger, Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition, in: Frontiers of Pattern Recogni-
tion, 1972, pp. 511519.
[16] H. Zhang, G. Sun, Optimal reference subset selection for nearest neighbor classification by tabu search, Pattern Recognit. 35 (2002)14811490.
[17] X. Xu, J. Jochen, H. Kriegel, A fast parallel clustering algorithm for large spatial databases, in: High Performance Data Mining, Springer,
2002, pp. 263290.
http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646576696A76657231393830656469746564s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib4B756E636865766139376669746E65737366756E6374696F6E73s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib4B756E636865766139376669746E65737366756E6374696F6E73s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646576696A76657231393830656469746564s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1