17
Available online at www.sciencedirect.com ScienceDirect Fuzzy Sets and Systems 258 (2015) 117–133 www.elsevier.com/locate/fss Parallel sampling from big data with uncertainty distribution Qing He a , Haocheng Wang a,b,, Fuzhen Zhuang a , Tianfeng Shang a,b , Zhongzhi Shi a a Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China b University of Chinese Academy of Sciences, Beijing 100049, China Available online 24 February 2014 Abstract Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that our algorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency. © 2014 Elsevier B.V. All rights reserved. Keywords: Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce 1. Introduction In many applications, data contain inherent uncertainty. The uncertainty phenomenon emerges owing to the lack of knowledge about the occurrence of some event. It is encountered when an experiment (sampling, classification, etc.) is to proceed, the result of which is not known to us; it may also refer to variety of potential outcomes, ways of solution, etc. [1]. Uncertainty can also arise in categorical data, for example, the inherent structure of a given sample set is uncertain for us. Moreover, the role of each sample in the inherent structure of the sample set is uncertain. * Corresponding author at: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China. E-mail addresses: [email protected] (Q. He), [email protected] (H. Wang), [email protected] (F. Zhuang), [email protected] (T. Shang), [email protected] (Z. Shi). http://dx.doi.org/10.1016/j.fss.2014.01.016 0165-0114/© 2014 Elsevier B.V. All rights reserved.

Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Available online at www.sciencedirect.com

ScienceDirect

Fuzzy Sets and Systems 258 (2015) 117–133

www.elsevier.com/locate/fss

Parallel sampling from big data with uncertainty distribution

Qing He a, Haocheng Wang a,b,∗, Fuzhen Zhuang a, Tianfeng Shang a,b, Zhongzhi Shi a

a Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190,China

b University of Chinese Academy of Sciences, Beijing 100049, China

Available online 24 February 2014

Abstract

Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is toproceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments ofdata collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big datawith uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Samplingmethod based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept ofMinimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in samplingfrom big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of allthe possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based onMapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments havebeen carried out on several data sets including real world data from UCI repository and synthetic data. The results show that ouralgorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the datasets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.© 2014 Elsevier B.V. All rights reserved.

Keywords: Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce

1. Introduction

In many applications, data contain inherent uncertainty. The uncertainty phenomenon emerges owing to the lackof knowledge about the occurrence of some event. It is encountered when an experiment (sampling, classification,etc.) is to proceed, the result of which is not known to us; it may also refer to variety of potential outcomes, ways ofsolution, etc. [1]. Uncertainty can also arise in categorical data, for example, the inherent structure of a given sampleset is uncertain for us. Moreover, the role of each sample in the inherent structure of the sample set is uncertain.

* Corresponding author at: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of ComputingTechnology, CAS, Beijing, 100190, China.

E-mail addresses: [email protected] (Q. He), [email protected] (H. Wang), [email protected] (F. Zhuang), [email protected](T. Shang), [email protected] (Z. Shi).

http://dx.doi.org/10.1016/j.fss.2014.01.0160165-0114/© 2014 Elsevier B.V. All rights reserved.

Page 2: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

118 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Fuzzy set theory developed by Zadeh [2] is a suitable theory that proved its ability to work in many real applications.It is worth noticing that fuzzy sets are a reasonable mathematical tool for handling the uncertainty in data [3].

With the rapid developments of data collection and distribution storage technologies, big data have become abigger-than-ever problem nowadays. Furthermore, there is a rapid growth in the hybrid study which connects theuncertainty and big data together. And dealing with big data with uncertainty distribution is one of the most importantissues of big data research. Uncertainty in big data brings an interesting challenge as well as opportunity. Manystate-of-the-art methods can only handle small scale of data sets, therefore, parallel process big data with uncertaintydistribution is very important.

Sampling techniques, which play a very important role in all classification methods, have attracted amounts ofresearch in the area of machine learning and data mining. Furthermore, parallel sampling from big data with uncer-tainty distribution becomes one of the most important tasks in the presence of the enormous amount of uncertain dataproduced these days.

Hyper Surface Classification (HSC), which is a general classification method based on Jordan Curve Theorem, isput forward by He et al. [4]. In this method, a model of hyper surface is obtained by adaptively dividing the samplesspace in the training process, and then the separating hyper surface is directly used to classify large database. The dataare classified according to whether the number of intersections with the radial is odd or even. It is a novel approachwhich has no need of either mapping from lower-dimensional space to higher-dimensional space or considering kernelfunction. HSC can efficiently and accurately classify two and three dimensional data. Furthermore, it can be extendedto deal with high dimensional data with dimension reduction [5] or ensemble techniques [6].

In order to enhance HSC performance and analyze its generalization ability, the notion of Minimal ConsistentSubset (MCS) is applied to the HSC method [7]. MCS is defined as consistent subset with a minimum number ofelements. For HSC method, the samples with the same category and falling into the same unit which covers at mostsamples from the same category make an equivalent class. The MCS of HSC is a sample subset combined by selectingone and only one representative sample from each unit included in the hyper surface. As a result, some samples in theMCS are replaceable, while others are not, leading to the uncertainty of elements in MCS. MCS includes the samenumber of elements, but the elements may be different samples. One of the most important features of MCS is that ithas the same classification model as the entire sample set, and can almost reflect its classification ability. For a givendata set, this feature is useful for obtaining the inherent structure which is uncertain for us. MCS is correspond to manyreal world problems, like classroom teaching. Specifically, the teacher explains some examples which is the MinimalConsistent Subset of various types of exercises at length to his students, then the students having been inspired will beable to solve the related exercises. However, the existing serial algorithm can only be performed on a single computer,and it is difficult for this algorithm to handle big data with uncertainty distribution. In this paper, we propose a ParallelSampling method based on Hyper Surface (PSHS) for big data with uncertainty distribution to get the MCS of theoriginal sample set whose inherent structure is uncertain for us. Experimental results in Section 4 show that PSHS candeal with large scale data sets effectively and efficiently.

Traditional sampling methods on huge amount of data consume too much time or even cannot be applied to bigdata due to memory limitation. MapReduce is developed by Google as a software framework for parallel comput-ing in a distributed environment [8,9]. It is used to process large amounts of raw data such as documents crawledfrom web in parallel. In recent few years, many classical data preprocessing, classification and clustering algo-rithms have been developed on MapReduce framework. MapReduce framework is provided with dynamic flexibilitysupport and fault tolerance by Google and Hadoop. In addition, Hadoop can be easily deployed on commodity hard-ware.

The remainder of the paper is organized as follows. In Section 2, preliminary knowledge is described, includingthe HSC method, MCS and MapReduce. Section 3 implements the PSHS algorithm based on MapReduce framework.In Section 4, we show our experimental results and evaluate our parallel algorithm in terms of effectiveness andefficiency. Finally, our conclusions are stated in Section 5.

2. Preliminaries

In this section we describe the preliminary knowledge, on which PSHS is based.

Page 3: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 119

2.1. Hyper surface classification

Hyper Surface Classification (HSC) is a general classification method based on Jordan Curve Theorem in Topology.

Theorem 1 (Jordan Curve Theorem). Let X be a closed set in n-dimensional space Rn. If X is homeomorphic to asphere Sn−1, then its complement Rn \ X has two connected components, one called inside, the other called outside.

According to the Jordan Curve Theorem, a surface can be formed in an n-dimensional space and used as theseparating hyper surface. For any given point, the following classification theory can be used to determine whetherthe point is inside or outside the separating hyper surface.

Theorem 2 (Classification Theorem). For any given point x ∈ Rn \ X, x is inside of X ⇔ the wind number i.e.intersecting number between any radial from x and X is odd, and x is outside of X ⇔ the intersecting numberbetween any radial from x and X is even.

The separating hyper surface is directly used to classify the data according to whether the number of intersectionswith the radial is odd or even [4]. This classification method is a direct and convenience method. From the twotheorems above, X is regarded as the classifier, which divides the space into two parts. And the classification processis very easy just by counting the intersecting number between a radial from the sample point and the classifier X. Itis a novel approach that has no need of making mapping from lower-dimensional space to higher-dimensional space.HSC has no need of kernel function. Furthermore, it can directly solve the non-linear classification problem via thehyper surface.

2.2. Minimal consistent subset

To handle the problem of high computational demands of nearest neighbor (NN), many efforts have been madefor selecting a representative subset of the original training data, like the “condensed nearest neighbor rule” (CNN)presented by Hart [10]. For a sample set, a consistent subset is a subset which, when used as a stored reference setfor the NN rule, correctly classifies all of the remaining points in the sample set. And the Minimal Consistent Subset(MCS) is defined as consistent subset with a minimum number of elements. Hart’s method indeed ensures consistency,but the condensed subset is not minimal, and is sensitive to the randomly picked initial selection and to the order ofconsideration of the input samples. After that, a lot of work has been done to reduce the size of the condensed subset[11–16]. The MCS of HSC is defined as follows.

For a finite sample set S, suppose C is the collection of all subsets. And C′ ⊆ C is a disjoint cover set for S, suchthat each element in S belongs to one and only one member of C ′. The MCS is a sample subset combined by choosingone sample and only one sample from each member in the disjoint cover set C′. For HSC, we call sample a and b

equivalent if they belong to the same category and fall into the same unit which covers at most samples from the samecategory. And the points falling into the same unit form an equivalent class. The cover set C′ is the union set of allequivalent classes in the hyper surface H. More specifically, let H̄ be the interior of H and u is a unit in H̄. The MCSof HSC denoted by Smin|H is a sample subset combined by selecting one and only one representative sample fromeach unit included in the hyper surface, i.e.

Smin|H =⋃

u⊆H̄{choosing one and only one s ∈ u} (1)

The computation method for the MCS of a given sample set is described as follows:

1) Input the samples, containing k categories and d dimensions. Let the samples be distributed within a rectangularregion.

2) Divide the rectangular region into

d︷ ︸︸ ︷10 × 10 × · · · × 10 small regions called units.

3) If there are some units containing samples from two or more different categories, then divide them into smallerunits repeatedly until each unit covers at most samples from the same category.

Page 4: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

120 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Fig. 1. Fuzzy boundary set.

4) Label each unit with 1,2, . . . , k, according to the category of the samples inside, and unite the adjacent units withthe same label into a bigger unit.

5) For each sample in the set, locate its position in the model, which means to figure out which unit it is located in.6) Combine samples that are located in the same unit into one equivalent class, then a number of equivalent classes

in different layers are got.7) Pick up one sample and only one sample from each equivalent class to form the MCS of HSC.

The algorithm above is not sensitive to the randomly picked initial selection and to the order of consideration ofthe input samples. And some samples in the MCS are replaceable, while others are not. Some close samples withinthe same category and falling into the same unit are equivalent to each other in the building of the classifier, and eachof them can be picked randomly for the MCS. On the contrary, sometimes there can be only one sample in a unit, andthis sample plays a unique role in forming the hyper surface. Hence the outcome of MCS is uncertain for us.

Note that, different division granularities lead to different separating hyper surfaces and inherent structures. Asseen in Fig. 1, each boundary denoted by dotted line (l1, l2, l3, etc.) may be used in the division process, and all thepossible separating hyper surfaces form a fuzzy boundary set. The samples in the fuzzy boundary set have differentmemberships for the separating hyper surface used in the division process. Specifically, the samples lie in dotted linel2 have maximum membership i.e. 1 for the separating hyper surface, while the samples lie in dotted line l1 and l3have uncertain memberships larger than 0.

For a specific sample set, the MCS almost reflects its classification ability. Any addition into the MCS will notimprove the classification ability, while every single deletion from MCS will lead to failure in testing accuracy. Thisfeature is useful for obtaining the inherent structure which is uncertain for us. However, all of the operations should beexecuted in memory. When dealing with large scale data sets, the existing serial algorithm will encounter the problemof insufficient memory.

2.3. MapReduce framework

MapReduce, as the framework showed in Fig. 2, is a simplified programming model and computation platform forprocessing distributed large scale data sets. It specifies the computation in terms of a map and a reduce function. Theunderlying runtime system automatically parallelizes the computation across large scale cluster of machines, handlesmachine failures, and schedules inter-machine communication to make efficient use of the network and disks.

As its name shows, map and reduce are two basic operations in the model. Users specify a map function thatprocesses a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges allintermediate values associated with the same intermediate key.

All data processed by MapReduce are in the form of key-value pairs. The execution happens in two phases. Inthe first phase, the map function is called once for each input record. For each call, it may produce any number ofintermediate key-value pairs. A map function is used to take a single key-value pair and output a list of new key-valuepairs. The type of output key and value can be different from input key and value. It could be formalized as:

Page 5: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 121

Fig. 2. Illustration of the MapReduce framework: the “map” is applied to all input records, which generates intermediate results that are aggregatedby the “reduce”.

map :: (key1, value1) ⇒ list(key2, value2) (2)

In the second phase, these intermediate pairs are sorted and grouped by the key2, and the reduce function is calledonce for each key. Finally, the reduce function is given all associated values for the key and outputs a new list ofvalues. Mathematically, this could be represented as:

reduce :: (key2, list(value2)) ⇒ (key3, value3) (3)

The MapReduce model provides sufficient high-level parallelization. Since the map function only takes a singlerecord, all map operations are independent of each other and fully parallelizable. Reduce function can be executed inparallel on each set of intermediate pairs with the same key.

3. Parallel sampling method based on hyper surface

In this section, the Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distri-bution will be summarized. Firstly, we give the representation of hyper surface inspired by decision tree. Secondly, weanalyze the conversion from the serial parts to the parallel parts in the algorithm. Then we explain how the necessarycomputations can be formalized as the map and reduce operations under MapReduce framework in detail.

3.1. Hyper surface representation

In fact, it is difficult to exactly represent a hyper surface of Rn space in the computer. Inspired by decision tree,we can use some labeled regions to approximate a hyper surface. All N input features except the class attribute canbe considered to be real numbers in the range [0,1). There is no loss of generality in this step. All physical quantitiesmust have some upper and lower bounds on their range, so suitable linear or non-linear transformations to the interval[0,1) can always be found. The inputs being in the range [0,1) means that these real numbers can be expressed asdecimal fractions. This is convenient because each successive digit position corresponds to a successive part of thefeature space.

Sampling is performed by simultaneously examining the most significant digit (MSD) of each of the N inputs.This either yields the equivalent class directly (a leaf of the tree), or indicates that we must examine the next mostsignificant digit (descend down a branch of the tree) to determine the equivalent class. The next decimal digit theneither yields the equivalent class, or tells us to examine the following digit further, and so on. Thus sampling isequivalent to find the region (a leaf node of the tree) representing an equivalent class, and pick up one sample andonly one sample from each region (a leaf node of the tree) to form the MCS of HSC. As sampling occurs one decimaldigit at a time, even the number having very long digits such as 0.873562, are handled with ease because the minimalnumber of digits required for successful sampling is usually very little. Before data can be sampled, the decision treemust be constructed as follows:

Page 6: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

122 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Table 19 samples of 4 dimension data.

Attribute 1 Attribute 2 Attribute 3 Attribute 4 Category

0.431 0.725 0.614 0.592 10.492 0.726 0.653 0.527 20.457 0.781 0.644 0.568 10.625 0.243 0.672 0.817 20.641 0.272 0.635 0.843 20.672 0.251 0.623 0.836 20.847 0.534 0.278 0.452 10.873 0.528 0.294 0.439 20.875 0.523 0.295 0.435 2

Table 2The most significant digits of 9 samples.

Sample MSD Category

s1 4765 1s2 4765 2s3 4765 1s4 6268 2s5 6268 2s6 6268 2s7 8524 1s8 8524 2s9 8524 2

1) Input all sample data, and normalize each dimension of them between [0,1). The entire feature space is mappedto the inside of a unit hyper-cube, referred as the root region.

2) Divide the region into sub regions by getting the most significant digit of each of the N inputs. The arrangementform of every N decimal digits can be viewed as a sub region.

3) For each sub region, if the samples in it belong to the same class, then label it with the samples’ class and attacha flag ‘P’, which means this region is pure and we can construct a leaf node. Else turn to step 4).

4) Label this region with the majority class and attach a flag ‘N’, on behalf of impurity. Then go to step 2) to get thenext most significant digits of the input features until all the sub regions become pure.

From the above steps, we can get a decision tree that describes the inherent structure of the data set. Every nodeof this decision tree can be regarded as a rule to classify the unseen data. For example, give a 4 dimension sample setshown in Table 1.

As all the samples have been normalized in [0,1), we can skip the first step. Then, we get the most significantdigits of every sample, as shown in Table 2.

The samples falling into the region (6268) belong to the same category 2, which means region (6268) is pure. Sowe label (6268) with category 2 and attach a flag ‘P’, then a rule (6268,2:P) is generated. Region (4765) has 2 samplesof category 1 and 1 sample of category 2. So we label it with category 1 and attach a flag ‘N’, leading to a new rule(4765,1:N). And we must further divide it into sub regions. Similarly, for region (8524) we can get a rule (8524,2:N)and also should divide it in the next circle.

Table 3 shows the result of getting the next most significant digits of the samples falling in the impure regions. Allthe sub regions of region (4765) and (8524) become pure, so we have rules (4375,1:P) and (7293,2:P) for the parentregion (8524), and rules (3219,1:P), (9252,2:P) and (5846,1:P) for the parent region (4765). The decision tree can beconstructed iteratively. The decision tree having the equivalent function of the generated rules is shown in Fig. 3. Wenotice that there is no need to construct the decision tree in memory, the rules can be generated straightforwardly,which can be used to design the Parallel Sampling method based on Hyper Surface.

Page 7: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 123

Table 3The next most significant digits.

Sample MSD Category

s1 3219 1s2 9252 2s3 5846 1s7 4375 1s8 7293 2s9 7293 2

Fig. 3. An equivalent decision tree of the generated rules.

3.2. The analysis of MCS from serial to parallel

In the existing serial algorithm, the most common operation is to divide a region containing more than one classinto smaller regions and then determine whether a sub region is pure or not. If a sub region is pure, all the samplesthat fall into it will not provide any useful information to construct other sub regions, thus they can be removed fromthe samples. So to determine whether the sub regions having the same parent region are pure or not can be parallelexecuted. From Section 2 we know that the process of MCS is to construct a multi-branched tree whose function issimilar to a decision tree. Therefore, we can construct one layer of the tree iteratively, from top to bottom, until eachleaf node that represents a region is pure.

3.3. The sampling process of PSHS

As the analysis above, PSHS algorithm needs three kinds of MapReduce job in iteration. In the first job, accordingto the value of each dimension, the map function performs the procedure of assigning each sample to a region itbelongs to. While the reduce function performs the procedure of determining whether a region is pure or not, andoutputs a string representing the region and its purity attribute. After this job, a layer of the decision tree has beenconstructed, and we must remove the unnecessary samples that are not useful to construct the next layer of the decisiontree, which is the task of the second job. In the third job, i.e. sampling job, the task of map function is to assign eachsample to a pure region it belongs to according to the rules representing pure regions. Since samples in the same pureregion are equivalent to each other in the building of the classifier, the reduce function can randomly pick one of themfor the MCS. Firstly, we present the details of the first job.

Map Step: The input data set is stored on HDFS which is a file system on hadoop as a sequence file of 〈key, value〉pairs, each of which represents a record in the data set. The key is the offset in bytes of this record to the start pointof the data file, and the value is a string of the content of a sample and its class. The data set is split and globallybroadcast to all mappers. The pseudo code of map function is shown in Algorithm 1. We can pass some parametersto the job before the map function invoked. For simplicity, we use dim to represent the dimension of the input featureexcept the class attribute, and layer to represent the level of the tree to be constructed.

In Algorithm 1, the main goal is to get the corresponding region a sample belongs to, which is accomplishedfrom step 3 to step 9. A character ‘:’ is appended after getting the digits of each dimension to indicate that a layer is

Page 8: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

124 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Algorithm 1 TreeMapper (key, value)Input: (key: offset in bytes; value: text of a record)Output: (key’: a string representing a region; value’: the class label of the input sample)

1. Parse the string value to an array, named data, of size dim and its class label named category;2. Set string outkey as a null string;3. for i = 1 to layer do4. for j = 0 to dim do5. append outkey with getNum(data[j ], i)6. end for7. if i < layer then8. append outkey with ‘:’9. end if

10. end for11. output(outkey,category)

Algorithm 2 getNum(num,n)

Input: (num: a double variable num in [0,1); n: an integer)Output: a character representing the n-th digit after the decimal point.

1. i ← n

2. while i > 0 do3. num ← num × 104. i ← i − 15. end while6. get the integer part of num and assign it to a variable ret7. ret ← ret%108. return the corresponding character of ret

finished. We invoked a procedure getNum(num, n) in the process. Its function is getting the n-th digit of num, whichis described in Algorithm 2.

Reduce Step: The input of the reduce function is the data obtained from the map function of each host. In reducefunction, we can count the number of samples for each class. If the class labels of all samples in a region are identical,this region is pure. If a region is impure, we will label it with the majority category. First we pass all the class labels,named categories, to the job as parameters which will be used in the reduce function. The pseudo code for reducefunction is shown in Algorithm 3. Fig. 4 shows the complete job procedure.

When the first job finished, we can get a set of regions that cover the whole samples. If a region is impure, we mustdivide it into sub regions until the sub regions are all pure. Hence, if a region is pure, the samples that fall in it arenot needed and can be removed. Therefore, the second job can be referred to as a filter whose function is to removethe unnecessary samples that are not useful to construct the next layer of the decision tree. We should read the impureregions into memory before we can decide whether a sample should be removed or not. We use a variable set to storethe impure regions. Then the second job’s mapper can be described in Algorithm 4. Hadoop provides a default reduceimplementation which outputs the result of the mapper, and it is what we adopt in the second job. The complete jobprocedure can be seen in Fig. 5.

The first job and second job run iteratively until the samples are all removed, in other words all the rules have beengenerated. We can get several rule sets each of whom represents a layer of the decision tree. In the sampling job,according to the rules representing pure regions, the map function performs the procedure of assigning each sampleto a pure region it belongs to. The rules representing pure regions should be read into memory before sampling. A listvariable rules is used to store all these rules. Then the pseudo code for map function of the sampling job is shown inAlgorithm 5.

In reduce function of sampling job, we can randomly pick one sample from each pure region for the MCS. Thepseudo code of reduce function is described in Algorithm 6. Fig. 6 shows the complete job procedure.

Page 9: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 125

Algorithm 3 TreeReducerInput: (key: a string representing a region; values: the list of class labels of all samples falling in this region)Output: (key’: identical to key; value’: the class label of this region plus its purity attribute)

1. Initial an array count to 0 with equal size to the number of all the class labels;2. Initial a counter totalnum to 0 to record the number of samples in this region;3. while values.hasNext() do4. get a class label c from values.next()5. count[c] + +6. totalnum + +7. end while8. find the majority class max from count and its corresponding index i

9. if all samples belong to max, i.e.totalnum = count[i] then

10. purity ← ‘P’11. else12. purity ← ‘N’13. end if14. construct value’ as the comprising of max and purity15. output(key,value’)

Fig. 4. Generating a layer of the decision tree.

Algorithm 4 FilterMapperInput: (key: offset in bytes; value: text of a record)Output: (key’: identical to value if this sample falls in an impure region; value’: a null string)

1. if this sample matches a rule of set then2. output(value, “”)3. end if

4. Experiments

In this section, we demonstrate the performance of our proposed algorithm with respect to effectiveness andefficiency by dealing with uncertainty distribution big data including real world data from UCI machine learningrepository and synthetic data. Performance experiments were run on a cluster of ten computers, six of them each hasfour 2.8 GHz cores and 4 GB memory, the rest four each has two 2.8 GHz cores and 4 GB memory. Hadoop version0.20.0 and Java 1.6.0_22 are used as the MapReduce system for all experiments.

Page 10: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

126 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Fig. 5. Filter.

Algorithm 5 SamplingMapperInput: (key: offset in bytes; value: text of a record)Output: (key’: a string representing a pure region; value’: identical to value)

1. Set string pureRegion as a null string;2. for i = 0 to (rules.length − 1) do3. if this sample matches rules[i] then4. pureRegion ← the string representing this region of rules[i]5. output(pureRegion,value)6. end if7. end for

Algorithm 6 SamplingReducerInput: (key: a string representing a pure region; values: the list of all samples falling in this region)Output: (key’: one random sample of each pure region; value’: a null string)

1. Set string samp as a null string;2. if values.hasNext() then3. samp ← values.next()4. output(samp, “”)5. end if

Fig. 6. Sampling.

Page 11: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 127

4.1. Effectiveness

First of all, to illustrate the effectiveness of PSHS more vivid and clear, the following figures are listed. We usetwo data sets from UCI repository as follows. Waveform data set has 21 attributes, 3 categories and 5000 samples.The data set of Poker Hand contains 25,010 samples from 10 categories in ten dimensional space. Both data sets aretransformed into three dimensions by using the method in [5].

The serial MCS computation method mentioned in [7] is executed to obtain the MCS of Poker Hand data set, andthen trained by HSC. The trained model of hyper surface is shown in Fig. 7. Furthermore, we adopt PSHS algorithmto obtain the MCS of this data set. For comparison, the MCS of a given sample set obtained by PSHS is denoted byPMCS, while the MCS of a given sample set obtained by the serial MCS computation method is denoted by MCS (thesame as follows). The PMCS is also used for training, whose hyper surface structure is shown in Fig. 8.

From the two figures above, we can see that the hyper surface structures between MCS and PMCS are totally thesame. They both have only one sample in each unit. No matter which we choose for training, either MCS or PMCS,we get the same hyper surface maintaining identical distribution. Note that Waveform data set, and the same hypersurface structures obtained by its MCS and PMCS are shown in Fig. 9.

For a specific sample set, the Minimal Consistent Subset almost reflects its classification ability. Table 4 shows theclassification ability of MCS and PMCS. All the data sets used here are got from UCI repository. From this table, wecan see that the testing accuracy obtained from PMCS is same with that obtained from MCS, which means that thePSHS algorithm is totally consistent with the serial MCS computation method.

One notable feature of PSHS—the ability to deal with uncertainty distribution big data is shown in Table 5. Weobtain the synthetic three dimensional data by following the approach used in [4], and carry out the actual numericalsampling and classification. The sampling time of PSHS is much better than that of the serial MCS computationmethod, yet achieving the same testing accuracy.

4.2. Efficiency

We evaluate the efficient performance of our proposed algorithm in terms of speedup, scaleup and sizeup [17] whendealing with uncertainty distribution big data. We use Breast Cancer Wisconsin data set from UCI repository, whichcontains 699 samples from two different categories. The data set is firstly transformed into three dimensions by usingthe method in Ref. [5], and then replicate it to get 3 million, 6 million, 12 million, and 24 million samples respectively.

Speedup: In order to measure the speedup, we keep the data set constant and increase the number of cores inthe system. More specifically, we first apply PSHS algorithm in a system consisting of 4 cores, and then graduallyincrease it. The core number of system varies from 4 to 32 and the size of the data set increases from 3 million to24 million. The speedup given by the larger system with m cores is measured as:

Speedup(m) = run-time on 1 core

run-time on m cores(4)

The perfect parallel algorithm demonstrates linear speedup: a system with m times the number of cores yields aspeedup of m. In practice, linear speedup is very difficult to achieve because of the communication cost and the skewof the slaves. The slowest slave determines the total time needed. If not every slave needs the same time, we have thisskew problem.

We have performed the speedup evaluation on data sets with different sizes. Fig. 10 demonstrates the results. As thesize of the data set increases, the speedup of PSHS becomes approximately linear, especially when the data set is bigsuch as 12 million and 24 million. We also notice that when the data set is small such as 3 million, the performanceof 32-core system is not significantly improved compared to that of 16-core system, which is not accord with ourintuition. The reason is that the time of processing 3 million data set is not very bigger than the communication timeamong the nodes and time occupied by fault-tolerance. However, as the data set increases, the processing time willoccupy the main part, leading to a good speedup performance.

Scaleup: Scaleup measures the ability to grow both the system and the data set size. It is defined as the ability ofan m-times larger system to perform an m-times larger job in the same run-time as the original system. The scaleupmetric is:

Scaleup(data,m) = run-time for processing data on 1 core(5)

run-time for processing m ∗ data on m cores

Page 12: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

128 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Fig. 7. Poker Hand data set and hyper surface structure obtained by its MCS.

Page 13: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 129

Fig. 8. PMCS and hyper surface structure obtained by PMCS of Poker Hand data set.

Page 14: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

130 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Fig. 9. The hyper surface structures obtained by MCS and PMCS of Waveform data set.

Table 4Comparison of classification ability.

Data set Sample No. MCSsample No.

PMCSsample No.

MCSaccuracy

PMCSaccuracy

Samplingratio

Iris 150 80 80 100% 100% 53.33%Wine 178 129 129 100% 100% 72.47%Sonar 208 186 186 100% 100% 89.42%Wdbc 569 268 268 100% 100% 47.10%Pima 768 506 506 99.21% 99.21% 65.89%Contraceptive Method Choice 1473 1219 1219 100% 100% 82.76%Waveform 5000 4525 4525 99.84% 99.84% 90.50%Breast Cancer Wisconsin 9002 1243 1243 99.85% 99.85% 13.81%Poker Hand 25,010 22,904 22,904 98.29% 98.29% 91.58%Letter Recognition 20,000 13,668 13,668 90.47% 90.47% 68.34%Ten Spiral 33,750 7285 7285 100% 100% 21.59%

Table 5Performance comparison on synthetic data.

SampleNo.

TestingsampleNo.

MCSsampleNo.

PMCSsampleNo.

MCSsamplingtime

PMCSsamplingtime

MCStestingaccuracy

PMCtestingaccuracy

1,250,000 5,400,002 875,924 875,924 14 m 21 s 1 m 49 s 100% 100%5,400,002 10,500,000 1,412,358 1,412,358 58 m 47 s 6 m 52 s 100% 100%

10,500,000 22,800,002 6,582,439 6,582,439 1 h 30 m 51 s 12 m 8 s 100% 100%22,800,002 54,000,000 12,359,545 12,359,545 3 h 15 m 37 s 25 m 16 s 100% 100%54,000,000 67,500,000 36,582,427 36,582,427 7 h 41 m 35 s 48 m 27 s 100% 100%

To demonstrate how well the PSHS deals with uncertainty distribution big data when more cores of computersare available, we have performed scalability experiments where we increase the size of the data set in proportionto the number of cores. The data sets size of 3 million, 6 million, 12 million, 24 million are performed on 4, 8, 16and 32 cores respectively. Fig. 11 shows the performance results on these data sets. As the data set becomes larger,

Page 15: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 131

Fig. 10. Speedup performance.

Fig. 11. Scaleup performance.

the scalability of PSHS drops slowly. It always maintains a value of scaleup higher than 84%. Obviously, the PSHSalgorithm scales very well.

Sizeup: Sizeup analysis holds the number of cores in the system constant, and grows the size of the data set. Sizeupmeasures how much longer it takes on a given system, when the data set size is m-times larger than the original dataset. The sizeup metric is defined as follows:

Sizeup(data,m) = run-time for processing m ∗ data

run-time for processing data(6)

To measure the performance of sizeup, we have fixed the number of cores to 4, 8, 16 and 32 respectively. Fig. 12shows the sizeup results on different cores. When the number of cores is small such as 4 and 8, the sizeup perfor-mances differ little. However, as more cores are available, the value of sizeup on 16 or 32 cores decreases significantlycompared to that of 4 or 8 cores on the same data sets. The graph demonstrates PSHS has a very good sizeup perfor-mance.

Page 16: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

132 Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133

Fig. 12. Sizeup performance.

5. Conclusion

With the advent of big data era, the demand for processing big data with uncertainty distribution is increasing. Inthis paper, we present a Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty dis-tribution to get the Minimal Consistent Subset (MCS) of the original sample set whose inherent structure is uncertain.Our experimental evaluation on both real and synthetic data sets showed that our approach can not only obtain consis-tent hyper surface structure and testing accuracy with the serial algorithm, but also perform efficiently according to thespeedup, scaleup and sizeup. Besides, our algorithm can process big data with uncertainty distribution on commodityhardware efficiently. It should be noted that PSHS is a universal algorithm, but the features may be very different withdifferent classification methods. We will further conduct the experiments and consummate the parallel algorithm toimprove usage efficiency of computing resources in the future.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 61035003, 61175052,61203297), National High-tech R&D Program of China (863 Program) (Nos. 2012AA011003, 2013AA01A606,2014AA012205).

References

[1] V. Novák, Are fuzzy sets a reasonable tool for modeling vague phenomena?, Fuzzy Sets Syst. 156 (2005) 341–348.[2] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338–353.[3] D. Dubois, H. Prade, Gradualness, uncertainty and bipolarity: Making sense of fuzzy sets, Fuzzy Sets Syst. 192 (2012) 3–24.[4] Q. He, Z. Shi, L. Ren, E. Lee, A novel classification method based on hypersurface, Math. Comput. Model. 38 (2003) 395–407.[5] Q. He, X. Zhao, Z. Shi, Classification based on dimension transposition for high dimension data, Soft Comput. 11 (2007) 329–334.[6] X. Zhao, Q. He, Z. Shi, Hypersurface classifiers ensemble for high dimensional data sets, in: Advances in Neural Networks – ISNN 2006,

Springer, 2006, pp. 1299–1304.[7] Q. He, X. Zhao, Z. Shi, Minimal consistent subset for hyper surface classification method, Int. J. Pattern Recognit. Artif. Intell. 22 (2008)

95–108.[8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (2008) 107–113.[9] R. Lammel, Google’s mapreduce programming model-revisited, Sci. Comput. Program. 70 (2008) 1–30.

[10] P. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (1968) 515–516.[11] V. Cerverón, A. Fuertes, Parallel random search and tabu search for the minimal consistent subset selection problem, in: Randomization and

Approximation Techniques in Computer Science, Springer, 1998, pp. 248–259.

Page 17: Parallel sampling from big data with uncertainty distribution sampling... · 2015. 12. 1. · issues of big data research. Uncertainty in big data brings an interesting challenge

Q. He et al. / Fuzzy Sets and Systems 258 (2015) 117–133 133

[12] B.V. Dasarathy, Minimal consistent set (mcs) identification for optimal nearest neighbor decision systems design, IEEE Trans. Syst. ManCybern. 24 (1994) 511–517.

[13] P.A. Devijver, J. Kittler, On the edited nearest neighbor rule, in: Proc. 5th Int. Conf. on Pattern Recognition, 1980, pp. 72–80.[14] L.I. Kuncheva, Fitness functions in editing k-nn reference set by genetic algorithms, Pattern Recognit. 30 (1997) 1041–1049.[15] C. Swonger, Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition, in: Frontiers of Pattern Recogni-

tion, 1972, pp. 511–519.[16] H. Zhang, G. Sun, Optimal reference subset selection for nearest neighbor classification by tabu search, Pattern Recognit. 35 (2002)

1481–1490.[17] X. Xu, J. Jochen, H. Kriegel, A fast parallel clustering algorithm for large spatial databases, in: High Performance Data Mining, Springer,

2002, pp. 263–290.