parallel sampling

Upload: suklaiit

Post on 03-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 parallel sampling

    1/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.1 (1-17)

    Available online at www.sciencedirect.com

    ScienceDirect

    Fuzzy Sets and Systems ()

    www.elsevier.com/locate/fss

    Parallel sampling from big data with uncertainty distribution

    Qing He a, Haocheng Wang a,b,, Fuzhen Zhuang a, Tianfeng Shang a,b, Zhongzhi Shi a

    a Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190,

    Chinab University of Chinese Academy of Sciences, Beijing 100049, China

    Abstract

    Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to

    proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of

    data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data

    with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling

    method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of

    Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling

    from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all

    the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on

    MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have

    been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that ouralgorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data

    sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.

    2014 Elsevier B.V. All rights reserved.

    Keywords: Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce

    1. Introduction

    In many applications, data contain inherent uncertainty. The uncertainty phenomenon emerges owing to the lack

    of knowledge about the occurrence of some event. It is encountered when an experiment (sampling, classification,

    etc.) is to proceed, the result of which is not known to us; it may also refer to variety of potential outcomes, ways of

    solution, etc.[1]. Uncertainty can also arise in categorical data, for example, the inherent structure of a given sample

    set is uncertain for us. Moreover, the role of each sample in the inherent structure of the sample set is uncertain.

    * Corresponding author at: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing

    Technology, CAS, Beijing, 100190, China.

    E-mail addresses:[email protected](Q. He),[email protected](H. Wang),[email protected](F. Zhuang),[email protected]

    (T. Shang),[email protected](Z. Shi).

    http://dx.doi.org/10.1016/j.fss.2014.01.016

    0165-0114/

    2014 Elsevier B.V. All rights reserved.

    http://www.sciencedirect.com/http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016http://www.elsevier.com/locate/fssmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.fss.2014.01.016http://dx.doi.org/10.1016/j.fss.2014.01.016mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.elsevier.com/locate/fsshttp://dx.doi.org/10.1016/j.fss.2014.01.016http://www.sciencedirect.com/
  • 8/12/2019 parallel sampling

    2/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.2 (1-17)

    2 Q. He et al. / Fuzzy Sets and Systems ()

    Fuzzy set theory developed by Zadeh [2] is a suitable theory that proved its ability to work in many real applications.

    It is worth noticing that fuzzy sets are a reasonable mathematical tool for handling the uncertainty in data [3].

    With the rapid developments of data collection and distribution storage technologies, big data have become a

    bigger-than-ever problem nowadays. Furthermore, there is a rapid growth in the hybrid study which connects the

    uncertainty and big data together. And dealing with big data with uncertainty distribution is one of the most important

    issues of big data research. Uncertainty in big data brings an interesting challenge as well as opportunity. Manystate-of-the-art methods can only handle small scale of data sets, therefore, parallel process big data with uncertainty

    distribution is very important.

    Sampling techniques, which play a very important role in all classification methods, have attracted amounts of

    research in the area of machine learning and data mining. Furthermore, parallel sampling from big data with uncer-

    tainty distribution becomes one of the most important tasks in the presence of the enormous amount of uncertain data

    produced these days.

    Hyper Surface Classification (HSC), which is a general classification method based on Jordan Curve Theorem, is

    putforward by He et al. [4]. In this method, a model of hyper surface is obtained by adaptively dividing the samples

    space in the training process, and then the separating hyper surface is directly used to classify large database. The data

    are classified according to whether the number of intersections with the radial is odd or even. It is a novel approach

    which has no need of either mapping from lower-dimensional space to higher-dimensional space or considering kernelfunction. HSC can efficiently and accurately classify two and three dimensional data. Furthermore, it can be extended

    to deal with high dimensional data with dimension reduction [5]or ensemble techniques[6].

    In order to enhance HSC performance and analyze its generalization ability, the notion of Minimal Consistent

    Subset (MCS) is applied to the HSC method [7]. MCS is defined as consistent subset with a minimum number of

    elements. For HSC method, the samples with the same category and falling into the same unit which covers at most

    samples from the same category make an equivalent class. The MCS of HSC is a sample subset combined by selecting

    one and only one representative sample from each unit included in the hyper surface. As a result, some samples in the

    MCS are replaceable, while others are not, leading to the uncertainty of elements in MCS. MCS includes the same

    number of elements, but the elements may be different samples. One of the most important features of MCS is that it

    has the same classification model as the entire sample set, and can almost reflect its classification ability. For a given

    data set, this feature is useful for obtaining the inherent structure which is uncertain for us. MCS is correspond to manyreal world problems, like classroom teaching. Specifically, the teacher explains some examples which is the Minimal

    Consistent Subset of various types of exercises at length to his students, then the students having been inspired will be

    able to solve the related exercises. However, the existing serial algorithm can only be performed on a single computer,

    and it is difficult for this algorithm to handle big data with uncertainty distribution. In this paper, we propose a Parallel

    Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distribution to get the MCS of the

    original sample set whose inherent structure is uncertain for us. Experimental results in Section4show that PSHS can

    deal with large scale data sets effectively and efficiently.

    Traditional sampling methods on huge amount of data consume too much time or even cannot be applied to big

    datadue to memory limitation. MapReduce is developed by Google as a software framework for parallel comput-

    ing in a distributed environment [8,9]. It is used to process large amounts of raw data such as documents crawled

    from web in parallel. In recent few years, many classical data preprocessing, classification and clustering algo-rithms have been developed on MapReduce framework. MapReduce framework is provided with dynamic flexibility

    support and fault tolerance by Google and Hadoop. In addition, Hadoop can be easily deployed on commodity hard-

    ware.

    The remainder of the paper is organized as follows. In Section 2, preliminary knowledge is described, including

    the HSC method, MCS and MapReduce. Section3implements the PSHS algorithm based on MapReduce framework.

    In Section 4, we show our experimental results and evaluate our parallel algorithm in terms of effectiveness and

    efficiency. Finally, our conclusions are stated in Section5.

    2. Preliminaries

    In this section we describe the preliminary knowledge, on which PSHS is based.

  • 8/12/2019 parallel sampling

    3/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.3 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 3

    2.1. Hyper surface classification

    Hyper Surface Classification (HSC) is a general classification method based on Jordan Curve Theorem in Topology.

    Theorem 1(Jordan Curve Theorem). LetX be a closed set in n-dimensional space R n. IfX is homeomorphic to a

    sphereSn1, then its complementRn \ X has two connected components, one called inside, the other called outside.

    According to the Jordan Curve Theorem, a surface can be formed in an n-dimensional space and used as the

    separating hyper surface. For any given point, the following classification theory can be used to determine whether

    the point is inside or outside the separating hyper surface.

    Theorem 2 (Classification Theorem). For any given point x R n \X, x is inside of X the wind number i.e.

    intersecting number between any radial from x and X is odd, and x is outside of X the intersecting number

    between any radial from x andX is even.

    The separating hyper surface is directly used to classify the data according to whether the number of intersections

    with the radial is odd or even [4]. This classification method is a direct and convenience method. From the two

    theorems above,X is regarded as the classifier, which divides the space into two parts. And the classification process

    is very easy just by counting the intersecting number between a radial from the sample point and the classifier X . It

    is a novel approach that has no need of making mapping from lower-dimensional space to higher-dimensional space.

    HSC has no need of kernel function. Furthermore, it can directly solve the non-linear classification problem via the

    hyper surface.

    2.2. Minimal consistent subset

    To handle the problem of high computational demands of nearest neighbor (NN), many efforts have been made

    for selecting a representative subset of the original training data, like the condensed nearest neighbor rule (CNN)

    presented by Hart[10].For a sample set, a consistent subset is a subset which, when used as a stored reference set

    for the NN rule, correctly classifies all of the remaining points in the sample set. And the Minimal Consistent Subset(MCS) is defined as consistent subset with a minimum number of elements. Harts method indeed ensures consistency,

    but the condensed subset is not minimal, and is sensitive to the randomly picked initial selection and to the order of

    consideration of the input samples. After that, a lot of work has been done to reduce the size of the condensed subset

    [1116]. The MCS of HSC is defined as follows.

    For a finite sample set S, suppose C is the collection of all subsets. And C C is a disjoint cover set forS, such

    that each element in Sbelongs to one and only one member ofC . The MCS is a sample subset combined by choosing

    one sample and only one sample from each member in the disjoint cover set C. For HSC, we call sample a and b

    equivalent if they belong to the same category and fall into the same unit which covers at most samples from the same

    category. And the points falling into the same unit form an equivalent class. The cover set C is the union set of all

    equivalent classes in the hyper surface H. More specifically, let H be the interior ofH andu is a unit in H. The MCS

    of HSC denoted by Smin|H is a sample subset combined by selecting one and only one representative sample fromeach unit included in the hyper surface, i.e.

    Smin|H =

    u H

    {choosing one and only ones u} (1)

    The computation method for the MCS of a given sample set is described as follows:

    1) Input the samples, containingk categories and ddimensions. Let the samples be distributed within a rectangular

    region.

    2) Divide the rectangular region into

    d 10 10 10 small regions called units.

    3) If there are some units containing samples from two or more different categories, then divide them into smaller

    units repeatedly until each unit covers at most samples from the same category.

  • 8/12/2019 parallel sampling

    4/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.4 (1-17)

    4 Q. He et al. / Fuzzy Sets and Systems ()

    Fig. 1. Fuzzy boundary set.

    4) Label each unit with 1, 2, . . . , k , according to the category of the samples inside, and unite the adjacent units with

    the same label into a bigger unit.

    5) For each sample in the set, locate its position in the model, which means to figure out which unit it is located in.6) Combine samples that are located in the same unit into one equivalent class, then a number of equivalent classes

    in different layers are got.

    7) Pick up one sample and only one sample from each equivalent class to form the MCS of HSC.

    The algorithm above is not sensitive to the randomly picked initial selection and to the order of consideration of

    the input samples. And some samples in the MCS are replaceable, while others are not. Some close samples within

    the same category and falling into the same unit are equivalent to each other in the building of the classifier, and each

    of them can be picked randomly for the MCS. On the contrary, sometimes there can be only one sample in a unit, and

    this sample plays a unique role in forming the hyper surface. Hence the outcome of MCS is uncertain for us.

    Note that, different division granularities lead to different separating hyper surfaces and inherent structures. As

    seen inFig. 1,each boundary denoted by dotted line (l1, l2, l3, etc.) may be used in the division process, and all thepossible separating hyper surfaces form a fuzzy boundary set. The samples in the fuzzy boundary set have different

    memberships for the separating hyper surface used in the division process. Specifically, the samples lie in dotted line

    l2 have maximum membership i.e. 1 for the separating hyper surface, while the samples lie in dotted line l1 and l3have uncertain memberships larger than 0.

    For a specific sample set, the MCS almost reflects its classification ability. Any addition into the MCS will not

    improve the classification ability, while every single deletion from MCS will lead to failure in testing accuracy. This

    feature is useful for obtaining the inherent structure which is uncertain for us. However, all of the operations should be

    executed in memory. When dealing with large scale data sets, the existing serial algorithm will encounter the problem

    of insufficient memory.

    2.3. MapReduce framework

    MapReduce, as the framework showed inFig. 2,is a simplified programming model and computation platform for

    processing distributed large scale data sets. It specifies the computation in terms of a map and a reduce function. The

    underlying runtime system automatically parallelizes the computation across large scale cluster of machines, handles

    machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

    As its name shows, map and reduce are two basic operations in the model. Users specify a map function that

    processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all

    intermediate values associated with the same intermediate key.

    All data processed by MapReduce are in the form of key-value pairs. The execution happens in two phases. In

    the first phase, the map function is called once for each input record. For each call, it may produce any number of

    intermediate key-value pairs. A map function is used to take a single key-value pair and output a list of new key-value

    pairs. The type of output key and value can be different from input key and value. It could be formalized as:

  • 8/12/2019 parallel sampling

    5/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.5 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 5

    Fig. 2. Illustration of the MapReduce framework: the map is applied to all input records, which generates intermediate results that are aggregated

    by the reduce.

    map::(key1, value1)list(key2, value2) (2)

    In the second phase, these intermediate pairs are sorted and grouped by thekey2, and the reduce function is called

    once for each key. Finally, the reduce function is given all associated values for the key and outputs a new list of

    values. Mathematically, this could be represented as:

    reduce::

    key2, list(value2)

    (key3, value3) (3)

    The MapReduce model provides sufficient high-level parallelization. Since the map function only takes a single

    record, all map operations are independent of each other and fully parallelizable. Reduce function can be executed in

    parallel on each set of intermediate pairs with the same key.

    3. Parallel sampling method based on hyper surface

    In this section, the Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distri-

    bution will be summarized. Firstly, we give the representation of hyper surface inspired by decision tree. Secondly, we

    analyze the conversion from the serial parts to the parallel parts in the algorithm. Then we explain how the necessary

    computations can be formalized as the map and reduce operations under MapReduce framework in detail.

    3.1. Hyper surface representation

    In fact, it is difficult to exactly represent a hyper surface ofRn space in the computer. Inspired by decision tree,

    we can use some labeled regions to approximate a hyper surface. All N input features except the class attribute can

    be considered to be real numbers in the range[0, 1). There is no loss of generality in this step. All physical quantities

    must have some upper and lower bounds on their range, so suitable linear or non-linear transformations to the interval[0, 1) can always be found. The inputs being in the range [0, 1) means that these real numbers can be expressed as

    decimal fractions. This is convenient because each successive digit position corresponds to a successive part of the

    feature space.

    Sampling is performed by simultaneously examining the most significant digit (MSD) of each of the N inputs.

    This either yields the equivalent class directly (a leaf of the tree), or indicates that we must examine the next most

    significant digit (descend down a branch of the tree) to determine the equivalent class. The next decimal digit then

    either yields the equivalent class, or tells us to examine the following digit further, and so on. Thus sampling is

    equivalent to find the region (a leaf node of the tree) representing an equivalent class, and pick up one sample and

    only one sample from each region (a leaf node of the tree) to form the MCS of HSC. As sampling occurs one decimal

    digit at a time, even the number having very long digits such as 0.873562, are handled with ease because the minimal

    number of digits required for successful sampling is usually very little. Before data can be sampled, the decision tree

    must be constructed as follows:

  • 8/12/2019 parallel sampling

    6/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.6 (1-17)

    6 Q. He et al. / Fuzzy Sets and Systems ()

    Table 1

    9 samples of 4 dimension data.

    Attribute 1 Attribute 2 Attribute 3 Attribute 4 Category

    0.431 0.725 0.614 0.592 1

    0.492 0.726 0.653 0.527 2

    0.457 0.781 0.644 0.568 10.625 0.243 0.672 0.817 2

    0.641 0.272 0.635 0.843 2

    0.672 0.251 0.623 0.836 2

    0.847 0.534 0.278 0.452 1

    0.873 0.528 0.294 0.439 2

    0.875 0.523 0.295 0.435 2

    Table 2

    The most significant digits of 9 samples.

    Sample MSD Category

    s1 4765 1s2 4765 2

    s3 4765 1

    s4 6268 2

    s5 6268 2

    s6 6268 2

    s7 8524 1

    s8 8524 2

    s9 8524 2

    1) Input all sample data, and normalize each dimension of them between [0, 1). The entire feature space is mapped

    to the inside of a unit hyper-cube, referred as the root region.2) Divide the region into sub regions by getting the most significant digit of each of the N inputs.The arrangement

    form of everyNdecimal digits can be viewed as a sub region.

    3) For each sub region, if the samples in it belong to the same class, then label it with the samples class and attach

    a flag P, which means this region is pure and we can construct a leaf node. Else turn to step 4).

    4) Label this region with the majority class and attach a flag N, on behalf of impurity. Then go to step 2) to get the

    next most significant digits of the input features until all the sub regions become pure.

    From the above steps, we can get a decision tree that describes the inherent structure of the data set. Every node

    of this decision tree can be regarded as a rule to classify the unseen data. For example, give a 4 dimension sample set

    shown inTable 1.

    As all the samples have been normalized in [0, 1), wecan skip the first step. Then, we get the most significantdigits of every sample, as shown inTable 2.

    The samples falling into the region (6268) belong to the same category 2, which means region (6268) is pure. So

    we label (6268) with category 2 and attach a flag P, then a rule (6268,2:P) is generated. Region (4765) has 2 samples

    of category 1 and 1 sample of category 2. So we label it with category 1 and attach a flag N, leading to a new rule

    (4765,1:N). And we must further divide it into sub regions. Similarly, for region (8524) we can get a rule (8524,2:N)

    and also should divide it in the next circle.

    Table 3shows the result of getting the next most significant digits of the samples falling in the impure regions. All

    the sub regions of region (4765) and (8524) become pure, so we have rules (4375,1:P) and (7293,2:P) for the parent

    region (8524), and rules (3219,1:P), (9252,2:P) and (5846,1:P) for the parent region (4765). The decision tree can be

    constructed iteratively. The decision tree having the equivalent function of the generated rules is shown inFig. 3. We

    notice that there is no need to construct the decision tree in memory, the rules can be generated straightforwardly,

    which can be used to design the Parallel Sampling method based on Hyper Surface.

  • 8/12/2019 parallel sampling

    7/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.7 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 7

    Table 3

    The next most significant digits.

    Sample MSD Category

    s1 3219 1

    s2 9252 2

    s3 5846 1s7 4375 1

    s8 7293 2

    s9 7293 2

    Fig. 3. An equivalent decision tree of the generated rules.

    3.2. The analysis of MCS from serial to parallel

    In the existing serial algorithm, the most common operation is to divide a region containing more than one class

    into smaller regions and then determine whether a sub region is pure or not. If a sub region is pure, all the samples

    that fall into it will not provide any useful information to construct other sub regions, thus they can be removed from

    the samples. So to determine whether the sub regions having the same parent region are pure or not can be parallelexecuted. From Section2 we know that the process of MCS is to construct a multi-branched tree whose function is

    similar to a decision tree. Therefore, we can construct one layer of the tree iteratively, from top to bottom, until each

    leaf node that represents a region is pure.

    3.3. The sampling process of PSHS

    As the analysis above, PSHS algorithm needs three kinds of MapReduce job in iteration. In the first job, according

    to the value of each dimension, the map function performs the procedure of assigning each sample to a region it

    belongs to. While the reduce function performs the procedure of determining whether a region is pure or not, and

    outputs a string representing the region and its purity attribute. After this job, a layer of the decision tree has been

    constructed, and we must remove the unnecessary samples that are not useful to construct the next layer of the decisiontree, which is the task of the second job. In the third job, i.e. sampling job, the task of map function is to assign each

    sample to a pure region it belongs to according to the rules representing pure regions. Since samples in the same pure

    region are equivalent to each other in the building of the classifier, the reduce function can randomly pick one of them

    for the MCS. Firstly, we present the details of the first job.

    Map Step:The input data set is stored on HDFS which is a file system on hadoop as a sequence file ofkey, value

    pairs, each of which represents a record in the data set. The keyis the offset in bytes of this record to the start point

    of the data file, and the value is a string of the content of a sample and its class. The data set is split and globally

    broadcast to all mappers. The pseudo code of map function is shown inAlgorithm 1. We can pass some parameters

    to the job before the map function invoked. For simplicity, we use dimto represent the dimension of the input feature

    except the class attribute, and layerto represent the level of the tree to be constructed.

    InAlgorithm 1, the main goal is to get the corresponding region a sample belongs to, which is accomplished

    from step 3 to step 9. A character : is appended after getting the digits of each dimension to indicate that a layer is

  • 8/12/2019 parallel sampling

    8/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.8 (1-17)

    8 Q. He et al. / Fuzzy Sets and Systems ()

    Algorithm 1TreeMapper (key,value)

    Input:(key: offset in bytes; value: text of a record)

    Output:(key: a string representing a region; value: the class label of the input sample)

    1. Parse the stringvalueto an array, nameddata, of size dim and its class label namedcategory;

    2. Set stringoutkeyas a null string;

    3. fori = 1 to layerdo

    4. forj= 0 todim do

    5. appendoutkeywith getNum(data[j], i)

    6. end for

    7. ifi 0 do

    3. numnum 10

    4. i i 1

    5. end while

    6. get the integer part ofnum and assign it to a variableret

    7. retret%10

    8. returnthe corresponding character ofret

    finished. We invoked a procedure getNum(num,n) in the process. Its function is getting the n-th digit ofnum, which

    is described inAlgorithm 2.

    Reduce Step:The input of the reduce function is the data obtained from the map function of each host. In reduce

    function, we can count the number of samples for each class. If the class labels of all samples in a region are identical,

    this region is pure. If a region is impure, we will label it with the majority category. First we pass all the class labels,

    named categories, to the job as parameters which will be used in the reduce function. The pseudo code for reduce

    function is shown inAlgorithm 3.Fig. 4shows the complete job procedure.

    When the first job finished, we can get a set of regions that cover the whole samples. If a region is impure, we must

    divide it into sub regions until the sub regions are all pure. Hence, if a region is pure, the samples that fall in it are

    not needed and can be removed. Therefore, the second job can be referred to as a filter whose function is to remove

    the unnecessary samples that are not useful to construct the next layer of the decision tree. We should read the impureregions into memory before we can decide whether a sample should be removed or not. We use a variable setto store

    the impure regions. Then the second jobs mapper can be described in Algorithm 4.Hadoop provides a default reduce

    implementation which outputs the result of the mapper, and it is what we adopt in the second job. The complete job

    procedure can be seen inFig. 5.

    The first job and second job run iteratively until the samples are all removed, in other words all the rules have been

    generated. We can get several rule sets each of whom represents a layer of the decision tree. In the sampling job,

    according to the rules representing pure regions, the map function performs the procedure of assigning each sample

    to a pure region it belongs to. The rules representing pure regions should be read into memory before sampling. A list

    variablerulesis used to store all these rules. Then the pseudo code for map function of the sampling job is shown in

    Algorithm 5.

    In reduce function of sampling job, we can randomly pick one sample from each pure region for the MCS. The

    pseudo code of reduce function is described inAlgorithm 6.Fig. 6shows the complete job procedure.

  • 8/12/2019 parallel sampling

    9/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.9 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 9

    Algorithm 3TreeReducer

    Input:(key: a string representing a region;values: the list of class labels of all samples falling in this region)

    Output:(key: identical tokey;value: the class label of this region plus its purity attribute)

    1. Initial an arraycountto 0 with equal size to the number of all the class labels;

    2. Initial a countertotalnumto 0 to record the number of samples in this region;

    3. whilevalues.hasNext()do

    4. get a class labelc fromvalues.next()

    5. count[c] + +

    6. totalnum + +

    7. end while

    8. find the majority classmaxfromcountand its corresponding index i

    9. ifall samples belong tomax, i.e.

    totalnum=count[i]then

    10. purity P

    11. else

    12. purity N

    13. end if

    14. constructvalueas the comprising ofmaxand purity

    15. output(key,value)

    Fig. 4. Generating a layer of the decision tree.

    Algorithm 4FilterMapper

    Input:(key: offset in bytes;value: text of a record)

    Output:(key: identical tovalueif this sample falls in an impure region; value: a null string)

    1. ifthis sample matches a rule ofsetthen2. output(value, )

    3. end if

    4. Experiments

    In this section, we demonstrate the performance of our proposed algorithm with respect to effectiveness and

    efficiency by dealing with uncertainty distribution big data including real world data from UCI machine learning

    repository and synthetic data. Performance experiments were run on a cluster of ten computers, six of them each has

    four 2.8 GHz cores and 4 GB memory, the rest four each has two 2.8 GHz cores and 4 GB memory. Hadoop version

    0.20.0 and Java 1.6.0_22 are used as the MapReduce system for all experiments.

  • 8/12/2019 parallel sampling

    10/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.10 (1-17)

    10 Q. He et al. / Fuzzy Sets and Systems ()

    Fig. 5. Filter.

    Algorithm 5SamplingMapper

    Input:(key: offset in bytes; value: text of a record)Output:(key: a string representing a pure region; value: identical tovalue)

    1. Set stringpureRegionas a null string;

    2. fori = 0 to (rules.length 1)do

    3. ifthis sample matchesrules[i]then

    4. pureRegion the string representing this region ofrules[i]

    5. output(pureRegion,value)

    6. end if

    7. end for

    Algorithm 6SamplingReducerInput:(key: a string representing a pure region;values: the list of all samples falling in this region)

    Output:(key: one random sample of each pure region;value: a null string)

    1. Set stringsampas a null string;

    2. ifvalues.hasNext()then

    3. samp values.next()

    4. output(samp, )

    5. end if

    Fig. 6. Sampling.

  • 8/12/2019 parallel sampling

    11/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.11 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 11

    4.1. Effectiveness

    First of all, to illustrate the effectiveness of PSHS more vivid and clear, the following figures are listed. We use

    two data sets from UCI repository as follows. Waveform data set has 21 attributes, 3 categories and 5000 samples.

    The data set of Poker Hand contains 25,010 samples from 10 categories in ten dimensional space. Both data sets are

    transformed into three dimensions by using the method in[5].The serial MCS computation method mentioned in[7]is executed to obtain the MCS of Poker Hand data set, and

    then trained by HSC. The trained model of hyper surface is shown in Fig. 7.Furthermore, we adopt PSHS algorithm

    to obtain the MCS of this data set. For comparison, the MCS of a given sample set obtained by PSHS is denoted by

    PMCS, while the MCS of a given sample set obtained by the serial MCS computation method is denoted by MCS (the

    same as follows). The PMCS is also used for training, whose hyper surface structure is shown in Fig. 8.

    From the two figures above, we can see that the hyper surface structures between MCS and PMCS are totally the

    same. They both have only one sample in each unit. No matter which we choose for training, either MCS or PMCS,

    we get the same hyper surface maintaining identical distribution. Note that Waveform data set, and the same hyper

    surface structures obtained by its MCS and PMCS are shown inFig. 9.

    For a specific sample set, the Minimal Consistent Subset almost reflects its classification ability.Table 4shows the

    classification ability of MCS and PMCS. All the data sets used here are got from UCI repository. From this table, we

    can see that the testing accuracy obtained from PMCS is same with that obtained from MCS, which means that the

    PSHS algorithm is totally consistent with the serial MCS computation method.

    One notable feature of PSHSthe ability to deal with uncertainty distribution big data is shown inTable 5. We

    obtain the synthetic three dimensional data by following the approach used in[4], and carry out the actual numerical

    sampling and classification. The sampling time of PSHS is much better than that of the serial MCS computation

    method, yet achieving the same testing accuracy.

    4.2. Efficiency

    We evaluate the efficient performance of our proposed algorithm in terms of speedup, scaleup and sizeup[17]when

    dealing with uncertainty distribution big data. We use Breast Cancer Wisconsin data set from UCI repository, which

    contains 699 samples from two different categories. The data set is firstly transformed into three dimensions by usingthe method in Ref.[5],and then replicate it to get 3 million, 6 million, 12 million, and 24 million samples respectively.

    Speedup: In order to measure the speedup, we keep the data set constant and increase the number of cores in

    the system. More specifically, we first apply PSHS algorithm in a system consisting of 4 cores, and then gradually

    increase it. The core number of system varies from 4 to 32 and the size of the data set increases from 3 million to

    24 million. The speedup given by the larger system withm cores is measured as:

    Speedup(m)=run-time on 1 core

    run-time onm cores(4)

    The perfect parallel algorithm demonstrates linear speedup: a system with m times the number of cores yields a

    speedup ofm. In practice, linear speedup is very difficult to achieve because of the communication cost and the skew

    of the slaves. The slowest slave determines the total time needed. If not every slave needs the same time, we have this

    skew problem.We have performed the speedup evaluation on data sets with different sizes. Fig. 10demonstrates the results. As the

    size of the data set increases, the speedup of PSHS becomes approximately linear, especially when the data set is big

    such as 12 million and 24 million. We also notice that when the data set is small such as 3 million, the performance

    of 32-core system is not significantly improved compared to that of 16-core system, which is not accord with our

    intuition. The reason is that the time of processing 3 million data set is not very bigger than the communication time

    among the nodes and time occupied by fault-tolerance. However, as the data set increases, the processing time will

    occupy the main part, leading to a good speedup performance.

    Scaleup:Scaleup measures the ability to grow both the system and the data set size. It is defined as the ability of

    an m-times larger system to perform an m-times larger job in the same run-time as the original system. The scaleup

    metric is:

    Scaleup(data, m)= run-time for processing data on 1 corerun-time for processingm dataon m cores

    (5)

  • 8/12/2019 parallel sampling

    12/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.12 (1-17)

    12 Q. He et al. / Fuzzy Sets and Systems ()

    Fig. 7. Poker Hand data set and hyper surface structure obtained by its MCS.

  • 8/12/2019 parallel sampling

    13/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.13 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 13

    Fig. 8. PMCS and hyper surface structure obtained by PMCS of Poker Hand data set.

  • 8/12/2019 parallel sampling

    14/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.14 (1-17)

    14 Q. He et al. / Fuzzy Sets and Systems ()

    Fig. 9. The hyper surface structures obtained by MCS and PMCS of Waveform data set.

    Table 4

    Comparison of classification ability.

    Data set Sample No. MCS

    sample No.

    PMCS

    sample No.

    MCS

    accuracy

    PMCS

    accuracy

    Sampling

    ratio

    Iris 150 80 80 100% 100% 53.33%

    Wine 178 129 129 100% 100% 72.47%

    Sonar 208 186 186 100% 100% 89.42%

    Wdbc 569 268 268 100% 100% 47.10%

    Pima 768 506 506 99.21% 99.21% 65.89%

    Contraceptive Method Choice 1473 1219 1219 100% 100% 82.76%

    Waveform 5000 4525 4525 99.84% 99.84% 90.50%

    Breast Cancer Wisconsin 9002 1243 1243 99.85% 99.85% 13.81%

    Poker Hand 25,010 22,904 22,904 98.29% 98.29% 91.58%

    Letter Recognition 20,000 13,668 13,668 90.47% 90.47% 68.34%

    Ten Spiral 33,750 7285 7285 100% 100% 21.59%

    Table 5

    Performance comparison on synthetic data.

    SampleNo.

    Testingsample

    No.

    MCSsample

    No.

    PMCSsample

    No.

    MCSsampling

    time

    PMCSsampling

    time

    MCStesting

    accuracy

    PMCtesting

    accuracy

    1,250,000 5,400,002 875,924 875,924 14 m 21 s 1 m 49 s 100% 100%

    5,400,002 10,500,000 1,412,358 1,412,358 58 m 47 s 6 m 52 s 100% 100%

    10,500,000 22,800,002 6,582,439 6,582,439 1 h 30 m 51 s 12 m 8 s 100% 100%

    22,800,002 54,000,000 12,359,545 12,359,545 3 h 15 m 37 s 25 m 16 s 100% 100%

    54,000,000 67,500,000 36,582,427 36,582,427 7 h 41 m 35 s 48 m 27 s 100% 100%

    To demonstrate how well the PSHS deals with uncertainty distribution big data when more cores of computers

    are available, we have performed scalability experiments where we increase the size of the data set in proportion

    to the number of cores. The data sets size of 3 million, 6 million, 12 million, 24 million are performed on 4, 8, 16

    and 32 cores respectively.Fig. 11shows the performance results on these data sets. As the data set becomes larger,

  • 8/12/2019 parallel sampling

    15/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.15 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 15

    Fig. 10. Speedup performance.

    Fig. 11. Scaleup performance.

    the scalability of PSHS drops slowly. It always maintains a value of scaleup higher than 84%. Obviously, the PSHSalgorithm scales very well.

    Sizeup:Sizeup analysisholds the number of cores in the system constant, and grows the size of the data set. Sizeup

    measures how much longer it takes on a given system, when the data set size is m-times larger than the original data

    set. The sizeup metric is defined as follows:

    Sizeup(data, m)=run-time for processingm data

    run-time for processingdata(6)

    To measure the performance of sizeup, we have fixed the number of cores to 4, 8, 16 and 32 respectively. Fig. 12

    shows the sizeup results on different cores. When the number of cores is small such as 4 and 8, the sizeup perfor-

    mances differ little. However, as more cores are available, the value of sizeup on 16 or 32 cores decreases significantly

    compared to that of 4 or 8 cores on the same data sets. The graph demonstrates PSHS has a very good sizeup perfor-

    mance.

  • 8/12/2019 parallel sampling

    16/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.16 (1-17)

    16 Q. He et al. / Fuzzy Sets and Systems ()

    Fig. 12. Sizeup performance.

    5. Conclusion

    With the advent of big data era, the demand for processing big data with uncertainty distribution is increasing. In

    this paper, we present a Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty dis-

    tribution to get the Minimal Consistent Subset (MCS) of the original sample set whose inherent structure is uncertain.

    Our experimental evaluation on both real and synthetic data sets showed that our approach can not only obtain consis-

    tent hyper surface structure and testing accuracy with the serial algorithm, but also perform efficiently according to the

    speedup, scaleup and sizeup. Besides, our algorithm can process big data with uncertainty distribution on commodity

    hardware efficiently. It should be noted that PSHS is a universal algorithm, but the features may be very different with

    different classification methods. We will further conduct the experiments and consummate the parallel algorithm to

    improve usage efficiency of computing resources in the future.

    Acknowledgements

    This work is supported by the National Natural Science Foundation of China (Nos. 61035003, 61175052,

    61203297), National High-tech R&D Program of China (863 Program) (Nos. 2012AA011003, 2013AA01A606,

    2014AA012205).

    References

    [1] V. Novk, Are fuzzy sets a reasonable tool for modeling vague phenomena?, Fuzzy Sets Syst. 156 (2005) 341348.

    [2] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338353.

    [3] D. Dubois, H. Prade, Gradualness, uncertainty and bipolarity: Making sense of fuzzy sets, Fuzzy Sets Syst. 192 (2012) 324.

    [4] Q. He, Z. Shi, L. Ren, E. Lee, A novel classification method based on hypersurface, Math. Comput. Model. 38 (2003) 395407.

    [5] Q. He, X. Zhao, Z. Shi, Classification based on dimension transposition for high dimension data, Soft Comput. 11 (2007) 329334.

    [6] X. Zhao, Q. He, Z. Shi, Hypersurface classifiers ensemble for high dimensional data sets, in: Advances in Neural Networks ISNN 2006,

    Springer, 2006, pp. 12991304.

    [7] Q. He, X. Zhao, Z. Shi, Minimal consistent subset for hyper surface classification method, Int. J. Pattern Recognit. Artif. Intell. 22 (2008)

    95108.

    [8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (2008) 107113.

    [9] R. Lammel, Googles mapreduce programming model-revisited, Sci. Comput. Program. 70 (2008) 130.

    [10] P. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (1968) 515516.

    [11] V. Cervern, A. Fuertes, Parallel random search and tabu search for the minimal consistent subset selection problem, in: Randomization andApproximation Techniques in Computer Science, Springer, 1998, pp. 248259.

    http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6E6F76616B3230303566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A616465683139363566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6475626F6973323031326772616475616C6E657373s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030336E6F76656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib686532303037636C617373696669636174696F6Es1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6465616E323030386D6170726564756365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6C616D6D656C32303038676F6F676C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib68617274313936386E656172657374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6365727665726F6E31393938706172616C6C656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib68617274313936386E656172657374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6C616D6D656C32303038676F6F676C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6465616E323030386D6170726564756365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030386D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616F32303036687970657273757266616365s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib686532303037636C617373696669636174696F6Es1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6865323030336E6F76656Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6475626F6973323031326772616475616C6E657373s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A616465683139363566757A7A79s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib6E6F76616B3230303566757A7A79s1
  • 8/12/2019 parallel sampling

    17/17

    JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.17 (1-17)

    Q. He et al. / Fuzzy Sets and Systems () 17

    [12] B.V. Dasarathy, Minimal consistent set (mcs) identification for optimal nearest neighbor decision systems design, IEEE Trans. Syst. Man

    Cybern. 24 (1994) 511517.

    [13] P.A. Devijver, J. Kittler, On the edited nearest neighbor rule, in: Proc. 5th Int. Conf. on Pattern Recognition, 1980, pp. 7280.

    [14] L.I. Kuncheva, Fitness functions in editing k-nn reference set by genetic algorithms, Pattern Recognit. 30 (1997) 10411049.

    [15] C. Swonger, Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition, in: Frontiers of Pattern Recogni-

    tion, 1972, pp. 511519.

    [16] H. Zhang, G. Sun, Optimal reference subset selection for nearest neighbor classification by tabu search, Pattern Recognit. 35 (2002)14811490.

    [17] X. Xu, J. Jochen, H. Kriegel, A fast parallel clustering algorithm for large spatial databases, in: High Performance Data Mining, Springer,

    2002, pp. 263290.

    http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646576696A76657231393830656469746564s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib4B756E636865766139376669746E65737366756E6374696F6E73s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib78753230303266617374s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib7A68616E67323030326F7074696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib73776F6E6765723139373273616D706C65s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib4B756E636865766139376669746E65737366756E6374696F6E73s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646576696A76657231393830656469746564s1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1http://refhub.elsevier.com/S0165-0114(14)00058-X/bib646173617261746879313939346D696E696D616Cs1