a theory of evidence-based method for assessing frequent patterns

Expert Systems with Applications 40 (2013) 3121–3127

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A Theory of Evidence-based method for assessing frequent patterns

Francisco Guil a,⇑, Roque Marín b

a Dept. of Languages and Computer Science, High School of Engineering, University of Almeria, Spainb Dept. of Information and Communication Engineering, Faculty of Computer Science, University of Murcia, Spain

a r t i c l e i n f o

Keywords:Frequent itemset miningTheory of EvidenceInformation measuresUncertainty management

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.eswa.2012.12.030

⇑ Corresponding author. Tel.: +34 950015787.E-mail addresses: [email protected] (F. Guil), roquemm@

a b s t r a c t

Frequent itemset (or frequent pattern) mining is a very important issue within the data mining field.Both, syntactic simplicity and descriptive potential, are the key features of the itemset-based patternwhich have led to its widespread use in a growing number of real-life domains. Some of the most repre-sentative algorithms for mining this kind of pattern are Apriori-like algorithms and, therefore, the num-ber of patterns obtained under normal conditions is very large, making the process of evaluation andinterpretation quite difficult. This problem is compounded if we consider that knowledge discovery isan iterative process, and the change in the parameters of the preprocessing techniques or the miningalgorithm can lead to significant changes in the result. In this paper, we propose a method based onShafer’s Theory of Evidence which uses two information measures for the quality evaluation of the setof frequent patterns. From a practical point of view, the main goal is to select, for a given database,the best preprocessing technique that lead to the discovery of useful knowledge. Nevertheless, the under-lying idea is to propose a formal method to assess, objectively, sets of frequent patterns, seen as beliefstructures, in terms of certainty in the information they represent.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Data mining is key point in the process known as KnowledgeDiscovery in Databases (KDD), and consists of applying data analysisand discovery algorithms which produce a particular enumerationof structures (models or patterns) over the data (Fayyad, Piatetky-Shapiro, & Smyth, 1996). In the case of pattern mining, the simplesttechnique aims to find association rules, whose complexity lies inthe extraction of frequent itemsets. From the set of frequent item-sets, the generation of association rules is straightforward. Sincethe problem of frequent itemset mining was introduced byAgrawal, Imielinski, and Swami (1993), many research works havebeen performed following different directions, including improve-ments of the Apriori algorithm, mining generalized, multi-level, orquantitative associations rules, mining weighted or fuzzy associa-tion rules mining, mining constraint-based rules, mining efficientlong patterns or event-based patterns, maintaining the discoveredassociations rules, etc.

A common feature of Apriori-like algorithms is that they cannothandle databases with continuous attributes. But, this importantpeculiarity does not concern only with this sort of algorithms since,in general, this characteristic is shared by many machine learningalgorithms (Elomaa & Rousu, 2004). Numerical attribute handling

ll rights reserved.

um.es (R. Marín).

is critical in inductive learning: depending of the learning method,the numerical domain is discretized in a data preprocessing step,or else is embedded within, as for example, a decision tree learningalgorithm. Basically, discretization techniques consist of the parti-tion of the numerical domain into a set of intervals, consideringeach interval as a category. There are automatic discretizationtechniques which find the partitions trying to optimize a givenevaluation function, and techniques that use expert knowledge todivide the domain into several intervals. In general, there are alarge number of discretization techniques, and their efficiency isoften a bottleneck in the knowledge discovery process.

Thus, in order to extract useful knowledge from such datasets(very common in real-world domains), a preprocessing phase (dis-cretization) is needed. The problem here is to select a good discret-ization method that, in an efficient way, produces a discretizeddataset from which useful knowledge can be obtained. If we weredealing with global data mining techniques, the accuracy of themodel would be a good way to select the best discretization tech-nique. But, if we are dealing with local methods (pattern mining), itis rather difficult to assess the quality of the discovered knowledge.The common huge number of mined patterns, a number which de-pends both on the user-defined parameters of the algorithm andthe used discretization method, makes it virtually impossible toevaluate objectively the outcome. Therefore, it becomes necessarya formal method for assessing mined sets of patterns which resultsin an objective value that indicates the quality degree of theknowledge that they represent.

http://dx.doi.org/10.1016/j.eswa.2012.12.030

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.eswa.2012.12.030

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

3122 F. Guil, R. Marín / Expert Systems with Applications 40 (2013) 3121–3127

In this paper, we propose a method to assess the quality ofmined frequent itemset-based patterns by the combined use oftwo uncertainty measures, the entropy and specificity, both pro-posed by Yager (2008) as indicators of the quality of evidence asso-ciated with a belief structure in the setting of the Theory ofEvidence. The key aspect of the method consists of considering setsof frequent patterns as bodies of evidence. Taking into consider-ation the Theory of Evidence point of view, data mining could beseen as an evidence-based agent. However, from data mining’spoint of view, the treatment of sets of frequent patterns as bodiesof evidence allows the use of information measures to quantify thequality of the discovered knowledge. Specifically, we propose aquality index based on the distance of the pair formed by the en-tropy and non-specificity measures with respect to the situationof total certainty.

In order to demonstrate the usefulness of the method, we havecarried out an empirical evaluation using four real datasets takenfrom the UCI Machine Learning Database Repository. With the aimof obtaining a set of frequent patterns which shows the regularitiespresented in the databases, we have discretized them using threedifferent discretization methods, and we have carried out two datamining processes for each discretized dataset, varying the value ofthe minimum support parameter. The quality index has been usedto assess the outcome of each data mining process, allowing theselection of the set of frequent patterns with a higher quality valuein terms of certainty.

The rest of this paper is organized as follows: Section 2 intro-duces the notation and basic definitions necessaries to define theproblem; in Section 3 it is introduced the theoretical foundationsof the proposed method, designed for evaluating sets of frequentpatterns; Section 4 describes an empirical evaluation with real dat-abases; finally, conclusions and future work are drawn in Section 5.

2. Problem definition

In this section we are going to introduce the notation and basicdefinitions for specifying in detail the proposal.

Let I ¼ fe1; . . . ; edg be a set of items. A subset of I , denoted as a,is called an itemset (or pattern). The size of an itemset is deter-mined by the number of items it contains. A transactional datasetD is a collection of transactions, D = {t1, . . . , tn}, where ti # I . Forany pattern a, we denote the set of transactions that contains aas Da = {tija # ti,ti 2 D}. For a transactional dataset D, a pattern ais frequent iif jDajP r, where jDaj is the pattern frequency (thenumber of its occurrences in the dataset), denoted as fr(a), and ris a user-defined parameter called minimum support (minsup).

Given a dataset D and a value for the user-defined parametercalled r, the aim of frequent pattern mining is to determine, inthe dataset, all the frequent patterns (itemsets) whose frequencyis greater than or equal to r (minsup). In order to achieve this goal,the algorithm follows a level-wise pattern generation satisfyingthe well-known Apriori property (downward closure property),which states that any subset of a frequent itemset is also frequent.Futhermore, although this property is used to prune the searchspace, it leads to an explosive number of frequent patterns, forexample: a frequent itemset with n items may generate 2n sub-associations, all of which are frequent too.

Thus, the result of the mining process consists of a set of fre-quent patterns denoted as BðDÞr ¼ fa1; . . . ;amg, where fr(ai) P r.The set BðDÞr is characterized by a frequency distribution f ={f1, . . . , fm} such as fi = fr(ai). From this function f, we obtain the rel-

ative frequency distribution �f ¼ f�f 1; . . . ;�f mg, such as �f i ¼ fiPifi.

In the context of knowledge discovery, once the set of frequentpatterns is extracted, the next step is its evaluation and interpreta-tion. These tasks are designed with the aim of discovering useful

knowledge, and they are usually performed by human experts.However, this is a very difficult process and can be rarely donewith the original set because of the large amount of frequent pat-terns that are often discovered. The problem is compounded if, in-stead of evaluating one set of frequent patterns, the expert has toevaluate an enumeration of sets obtained by varying the parame-ters of the algorithms designed to preprocess the databases (datacleaning or discretization among others), or the parameters ofthe mining algorithm. In such case, the evaluation process mustbe carried out in parallel for each of the mined set of patterns, hav-ing, for this reason, a much more complicated problem. If the ex-pert has to compare two (or more) sets of frequent patterns inorder to determine which one is the best for obtaining usefulknowledge, what method should he/she follow? Would be it betterthe set with more patterns, or the one with the longest patterns?Does the structure and frequency of the patterns not affect to thequality of the information they represent?

Taking into consideration an experimental basis, we can deter-mine the best set obtaining, for each one, a classification modeland studying its accuracy, however, several problems arise from thisapproach. On the one hand, there is not a single method to obtain aclassification model from a set of frequent patterns (Thabtah, 2007).What is more, the generation of a classification model from frequentpatterns is still an open (and very interesting) problem. On the otherhand, assuming that it would be possible to obtain a suitable modelfrom each set, this evaluation method would be time-consumingand it must be done using a trial-error method of problem solving.

So, it seems quite interesting to study a formal method ofassessment that indicates the degree of quality of the set of fre-quent patterns without having to generate any later model, i.e., aformal method that determines the degree of certainty of a set ofpatterns from the structure and frequency of each pattern whichconstitutes it. This is, precisely, the main goal of the next section.

3. Assessing frequent patterns

Starting with a set of frequent patterns characterized by a fre-quency distribution, our goal here is to introduce the informationmeasures that will enable us to achieve its quality in terms of cer-tainty. In Yager (2008), the author states the concepts of entropy(from the probabilistic framework) and specificity (from the possi-bilistic framework), Shafer’s Theory framework. Both informationmeasures of uncertainty provide complementary quality measuresof a body of evidence. With this proposal, Yager extended the The-ory of Evidence of Shafer (also known as Dempster–Shafer Theory),developed for modeling complex systems with a wide range ofapplications in real domains, like presented in Beynon, Cosker,and Marshall (2001) and Luo, Yang, Hu, and Hu (2012).

In the following section, a brief summary of Shafer’s Theory ofEvidence is presented. Finally, the section concludes with the pro-posed method, defining the quality measures and setting their goals.

3.1. Shafer’s Theory of Evidence

Shafer’s Theory of Evidence is based on a special fuzzy measurecalled belief measure. Beliefs can be assigned to propositions to ex-press the uncertainty associated to them by being discerned. Givena finite universal set U (the frame of discernment), beliefs are usu-ally computed on the basis of a density function m : 2U ! ½0;1�,called basic probability assignment (bpa):

mð;Þ ¼ 0; andXA #U

mðAÞ ¼ 1 ð1Þ

The function m(A) represents the belief exactly committed to the setA. If m(A) > 0, then A is called a focal element. The set of focal ele-ments constitute a core:

F. Guil, R. Marín / Expert Systems with Applications 40 (2013) 3121–3127 3123

F ¼ fA #U : mðAÞ > 0g ð2Þ

The core and its associated bpa define a body of evidence, fromwhere a belief function Bel : 2U ! ½0;1� is defined:

BelðAÞ ¼X

BjB # A

mðBÞ ð3Þ

From any given measure Bel, a dual measure Pl:2U ? [0,1] can bedefined:

PlðAÞ ¼X

BjB\A – ;mðBÞ ð4Þ

It can be verified (Shafer, 1976) that the functions Bel and Pl are apossibility (or necessity) measure, respectively, if and only if the fo-cal elements form a nested or consonant set, that is, if it can be or-dered in such a way that each set is contained within the next. Itthat case, the associated belief and plausibility measures have thefollowing properties. For all A;B 2 2U ,

BelðA \ BÞ ¼ NðA \ BÞ ¼ min½BelðAÞ;BelðBÞ� ð5Þ

and

PlðA [ BÞ ¼ PðA [ BÞ ¼ max½PlðAÞ; PlðBÞ�; ð6Þ

where N and P are the necessity and possibility measures, respec-tively. If all the focal elements are singletons, that is, A = x, wherex 2 U , then:

BelðAÞ ¼ PrðAÞ ¼ PlðAÞ ð7Þ

where Pr(A) is the probability of A.In general terms,

BelðAÞ 6 PrðAÞ 6 PlðAÞ ð8Þ

A significant aspect of Shafer’s structure is the ability to represent,in this common framework, different types of uncertainty, that is,probabilistic and possibilistic uncertainty.

3.2. Specificity and entropy

The concept of information measures has been entirely re-newed by the research in the setting of modern theories of uncer-tainty, like Fuzzy Sets (Zadeh, 1965), Possibility theory (Zadeh,1978), and Shafer’s Theory of Evidence (Shafer, 1976). An impor-tant aspect is that all of these theories are quite related, althoughthey have a different approach when compared to Probability the-ory. The new types of generalized information measures enableseveral facets of uncertainty to be discriminated, modeled, andmeasured. Let BðUÞ be the set of normal bodies of evidence onUðmð;Þ ¼ 0Þ. An information measure will be any mappingf : BðUÞ ! ½0;þ1Þ (the non-negative real line) (Dubois, Prade, &Yager, 1999). The function f depends, supposedly, on both the coreðFÞ of the body of evidence to which it applies, and its associatedbpa (m). The function f pertains to some properties of bodies of evi-dence and assesses the extent to which it is satisfied. In the paper(Dubois et al., 1999), the authors study three different measures inthe general framework of evidence theory, corresponding to differ-ent properties of bodies of evidence: imprecision, dissonance andconfusion measures. However, in this paper, we only take into ac-count two particular measures of imprecision and dissonance, thespecificity and entropy (respectively), because, as Yager pointedout, the combination of them is a good approach to measure thequality of a particular body of evidence.

Initially, Yager expounded the concept of specificity (Yager,1981) as an amount that estimates the precision of a fuzzy set.Later, the author Yager (1983) and Dubois and Prade (1985),extended the concept to deal with bodies of evidence, definingthe measure of specificity (Sp) of a body of evidence as:

SpðmÞ ¼XA #U

mðAÞjAj ð9Þ

jAj denotes the number of elements of the set A, that is, the cardinal-ity of A. It is easy to see that 1 � Sp(m) is a measure of imprecision,related to the concept of non-specificity. We will denote this impre-cision value as Jm, that is, Jm = 1 � Sp(m).

Also, in Yager (1983), the author developed the Shannon entro-py to bodies of evidence, considering the following expression,where ln is the Naperian natural logarithm:

EmðmÞ ¼ �XA #U

mðAÞlnðPlðAÞÞ ð10Þ

Pl(A) is the plausibility of A, and ln(Pl(A)) can be interpreted in termsof Shafer’s weight of conflict. As the Shannon entropy, Em(m) is ameasure of discordance associated with the body of evidence.

In the ideal situation, where no uncertainty is presented in thebody of evidence, Em(m) = Jm(m) = 0. Exactly, this is the key point ofthe proposed method for measuring the certainty associated with abelief structure. Em provides a dissonance measure of the evidence,whereas Jm provides a dispersion measure of the belief. So, the low-er the Em, the more consistent is the evidence and, the lower Jm (thehigher Sp), the less diverse. For certainty, we want low Em and Sm

measures. So, by using a combination of both measures, we can ob-tain a good indication of the quality of evidence. For measuring thisindicator (QðmÞ, denoting quality), we suggest an heuristic methodbased on the inverse of the Euclidean distance between the pair(Em, Jm) and the pair (0,0), that is:

QðmÞ ¼ 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðEmÞ2 þ ðJmÞ

2q ð11Þ

3.3. Evaluation of a frequent pattern set

Let BðDÞr be a set of frequent patterns, i.e., Br ¼ fa1; . . . ;amg,mined from the dataset D. And let �f be the relative frequency dis-tribution associated with BðDÞr. Taking into account that a fre-quent pattern is a set of items, let I be the set of all the frequentitems presented in BðDÞr. Making a syntactic correspondence withthe basic elements of the Theory of Evidence, we obtain that thepair ðBðDÞr;�f Þ is a body of evidence, where BðDÞr is the core(formed by the set of focal elements ai # I), and �f is the basic prob-ability assignment. In this case, I is the frame of discernment.

Once this correspondence is established, the evaluation of theset of frequent patterns will be carried out by the method de-scribed above, i.e., using the function Q(m) defined in Eq. (11). Ingeneral terms, this measure adds additional and valuable informa-tion to the expert about the quality of the set. But, particularly, inthis paper, Q(m) function (specifically Qð�f Þ) will be used to com-pare objectively sets of frequent patterns mined from real dat-abases, which has been discretized using different discretizationmethods. Let Dmr be an Apriori-like data mining algorithm charac-terized by the user-defined parameter r (or minsup), and let Dbi bea database. A special feature of Dmr is that it cannot handle numer-ical attributes. Since Dbi may contain continuous attributes, verycommon in real datasets, it will be require a discretization methodto obtain a dataset with only nominal attributes. Let dj be a discret-ization method that generates a dataset from a certain database

Dbi, i.e., Dij ¼ djðDbiÞ. The execution of the algorithm on each of

the datasets Dm Dij

� �� will results in different sets of frequent

patterns, denoted as B Dij

� �r, each one characterized by a

frequency distribution �f . In order to compare the differentdiscretization methods and determine which one provides

Table 1Transactional dataset.

Date Item list

0 d i l w1 g f k l m2 d g f v3 e l4 d p5 g s p6 d i v7 d e f8 g e k9 h i j k

Table 2Frequent patterns (minsup = 2).

i ai fi fii ai fi fi

1 {d} 5 0.13158 8 {p} 2 0.052632 {e} 3 0.07895 9 {v} 2 0.052633 {f} 3 0.07895 10 {d, f} 2 0.052634 {g} 4 0.10526 11 {d, i} 2 0.052635 {i} 3 0.07895 12 {d,v} 2 0.052636 {k} 3 0.07895 13 {f,g} 2 0.052637 {l} 3 0.07895 14 {g,k} 2 0.05263

Table 3Plausibility values of frequent patterns (minsup = 2).

i Pl(ai) i Pl(ai)

1 0.28947 8 0.052632 0.07895 9 0.105263 0.18421 10 0.421054 0.21053 11 0.368425 0.13158 12 0.342116 0.13158 13 0.342117 0.07895 14 0.28947


patterns containing information with less uncertainty, we suggestthe use of the Qð�f Þ function, such as the best method is the onethat generates a set with the highest value of Q. For a more com-plete assessment, we also have used different values for the mini-mum support parameter of the Dmr algorithm in the empiricalevaluation. Fig. 1 shows, schematically, the followed miningprocess.

Example 1 (Case study with a toy dataset.) As an example toillustrate the proposed method, let D be the (toy) transactionaldataset showed in Table 1, obtained from a supermarket database.Setting, for example, minsup = 20%, which corresponds to a min-imum frequency of 2 transactions, implies that every itemset-based pattern will be considered frequent if its frequency is greateror equal to 2.

Table 2 shows the set of frequent patterns, BðDÞ2, formed by atotal of 14 itemset-based patterns. The attribute i denotes thenumber of pattern (assuming a lexicographic order), ai the pattern,f the frequency value, i.e., the number of its occurrences in thedataset, and �f its relative frequency, computing this value as�f i ¼ fi=

Pfi.

As we established above, the pair ðBðDÞ2;�f Þ forms a body of evi-dence, from which information measures are computed followingthe expressions Eqs. (9), and (10). While Sp depends on �f ; Em de-pends on both, �f and the plausibility (Pl) value of each pattern.Therefore, we need to compute firstly the plausibility values fol-lowing the expression Eq. (4) (see Table 3).

Once the intermediate values are computed, we can concludethat, Spð�f Þ ¼ 0:86842 and, therefore, Jmð�f Þ ¼ 1� Spð�f Þ ¼ 0:13158.In addition, Emð�f Þ ¼ 1:73151. Finally, we compute the quality in-dex, obtaining Q = 0.57587.

In the best of our knowledge, the treatment of frequent pat-terns (characterized by its structure and frequency distribution)as a body of evidence is a novel approach that enables us to as-sess, in a formal way, the patterns extracted by data mining algo-rithms, considered as evidence-based agents. A preliminaryversion of this proposal was presented in Guil, Palacios, Campos,and Marín (2010), although in this case, the solution was focusedin the evaluation of temporal patterns extracted from a medicaldatabase.

Db

1discretization tech.

dataset

minsup

...

mining algorithm

freq. patterns freq. patterns...

Fig. 1. Mining

4. Empirical evaluation

From a practical point of view, and with the aim of presentingdifferent case studies, we have carried out a series of experiments

i

discretization tech.

dataset

minsupmining algorithm

freq. patterns freq. patterns...

j

process.

Table 4Main characteristics of the preprocessed datasets.

Database Dataset #Att0 #AttR

D11

8 5

diabetes.data D21

8 3

D31

8 3

D12

60 2

sonar.data D22

60 7

D32

60 1

D13

18 10

vehicles.data D23

18 12

D14

18 7

D24

34 4

ionosphere.data D34

34 5

D34

34 3


taking into consideration four real datasets from the UCI MachineLearning Database Repository (Frank & Asuncion, 2010): (1) PimaIndians Diabetes Database (diabetes.data), (2) Sonar: mines vs. rocksDatabase (sonar.data), (3) Vehicle Silhouettes (vehicles.data), and (4)Ionosphere Database (ionosphere.data). A brief description of eachdatabase presented in the repository is:

(1) Diabetes.data: This database collects data from 768patients. The data belongs to a diagnostic problem where abinary-based variable is investigated to determine whetherthe patient shows signal of diabetes according to WorldHealth Organization. A total of 8 attributes are consideredin this problem.

(2) Sonar.data: This database contains 208 instances, fromwhich 111 correspond to mines and the rest to rocks. Eachinstance has been obtained by making sonar signals bounceoff a metal cylinder (or rock) at various angles and undervarious conditions. The transmitted sonar signal is a fre-quency-modulated chirp, rising in frequency. Each instanceis formed by 60 numbers (attributes) in the range from 0.0to 1.0, where each number represents the energy within aparticular frequency band, integrated over a certain periodof time.

(3) Vehicles.data: This database collects data from 846 vehicleswith the purpose of classifying a given silhouette as one offour types of vehicle, using a set of features extracted fromthe silhouette. In total, the database contains 18 featuresper instance.

(4) Ionosphere.data: This database collects data collected by asystem which consists of a phased array of 16 high-frequency antennas. The target were free electrons in theionosphere. There is two class attributes: good radars whichshow evidence of some type of structure in the ionosphere,and bad radars in which their signals pass through the ion-osphere. It contains a total of 351 instances, where eachinstance is formed by 34 attributes.

The ultimate goal is to obtain useful knowledge from eachdatabase. For such a task we first extract (or mine) the set of fre-quent patterns that will serve as a basis for obtaining usefulknowledge through its interpretation and evaluation by domainexperts. In our particular case, we have used an efficient algorithmnamed TSET–Miner (Guil & Marín, 2012) but, although this algo-rithm allows the extraction of large sets of frequent patterns ina competitive time, it cannot handle numeric attributes (or items).This is a common characteristic in all Apriori-based algorithms de-signed to obtain frequent patterns. This peculiarity makes it neces-sary to add a new preprocessing step in the global process, thediscretization process. An important fact is that, depending onthe discretization method used, as with the values of the parame-ters of the mining algorithm, the number and type of patternchange dramatically. Therefore, it is crucial that the expert deter-mines, with as little effort as possible, which may be the best dis-cretization method that leads to the obtaining of usefulknowledge. Currently, one possibility is to assess, individually,the patterns obtained when trying to determine the best set. Inthis sense, we can try to reduce the complexity of the problemby obtaining compact subsets like maximal or closed patterns(Bayardo, 1998; Pasquier, Bastide, Taouil, & Lakhal, 1999). Anotherpossibility is to determine the best set through the derived associ-ation rules with greater confidence (or other measures like pre-sented in Berzal, Blanco, Sanchez, & Vila (2002)). But, after all,the evaluation and interpretation processes are still exhaustingand must be carried out through trial and error techniques. Thus,it is of great interest to determine a measure to assess and com-pare, objectively, different sets of patterns.

Then, from each database Dbi, we have obtained three discret-ized datasets using three different discretization techniques (di)implemented in the WEKA workbench (Hall et al., 2009). The firstdiscretization technique (d1) uses the default values of the param-eters. By simple binning, it processes a range of numeric attributesinto nominal attributes and each continuous domain is dividedinto 10 bins (or intervals). The second technique (d2) determinesthe number of intervals using cross-validated entropy (the find-NumBins parameter is set to true), and the d3 technique usesequal-frequency binning (setting the useEqualFrequency parameterto true). So, after preprocessing the databases, we have obtained 9datasets, Dj

i, where i and j varies from 1 to 3.Once the datasets have been obtained, and with the aim of min-

imizing the complexity of the problem, before starting with themining process, we have already performed an attribute selectionpreprocessing, obtaining the results shown in Table 4. The attri-bute selection used is the WrappedSubsetEval technique imple-mented in WEKA, which evaluates attribute sets by using alearning scheme, in our case, the J48 classifier (a clone of thewell-known decision tree C4.5). The underlying idea is to mine pat-terns from datasets where relevant attributes are taken into ac-count. The #Att0 column denotes the number of attributes of theoriginal database, whereas #AttR denotes the number of relevantattributes after the attribute selection phase.

For each dataset Dji, we have obtained a set of frequent patterns

with two different values for minimum support (r = 3 and r = 1),denoting the minimum frequency of a pattern to be considered fre-quent. Table 5 shows the number of frequent patterns obtained un-der each experimental condition.

On the one hand, we want to highlight the impact caused byboth, preprocessing technique and mining parameters, in the num-ber of frequent patterns obtained. For example, in the case of thedataset vehicle.data, the d3 discretization technique, associatedwith seven relevant attributes, results in a total of 55,029 patternsif we set the minimum frequency to 1. However, using the samevalue for the parameters of the mining algorithm, the d2 technique,with 12 relevant attributes, gives rise to 2,131,376 patterns (whichmeans an increase of 38.73, approximately). On the other hand, itis remarkable that the great number of patterns that extracts usu-ally a mining algorithm must be evaluated and interpreted by hu-man experts in order to obtain useful knowledge. Although thereare post-data mining techniques designed to help the expert in thistask, most of them consist of eliminating subsets of patterns fol-lowing a given criteria. In our opinion, each and every one of thefrequent patterns is equally important and must be taken into ac-count along the global knowledge discovery process. So, it seemsquite important to design a formal method that assigns a valueto each set of frequent patterns indicating the quality information

Table 5Number of patterns vs. minimum frequency.

Database Dataset #Pðfr¼3Þ #Pðfr¼1Þ

D11

1439 4104

diabetes.data D21

153 210

D31

381 792

D12

64 34

sonar.data D22

1914 13,986

D32

10 10

D13

63,861 478,153

vehicles.data D23

226,131 2,131,376

D33

8070 55,029

D14

411 1,196

ionosphere.data D24

630 1921

D34

194 420

Table 7Quality values associated with sonar.data.

Dataset Measures fr = 3 fr = 1

Em 1.801029 1.865499

D12

Jm 0.153912 0.166667

Q 0.553222 0.533923Em 1.319725 1.089111

D22

Jm 0.556736 0.658062

Q 0.698153 0.785866Em 2.302398 2.302398

D32

Jm 0 0

Q 0.434330 0.434330

Table 8Quality values associated with vehicle.data.


Em 1.264328 1.020741

D13

Jm 0.713081 0.770891

Q 0.688917 0.781779Em 0.996083 0.747035

D23

Jm 0.763895 0.813988

Q 0.796638 0.905121Em 2.300707 1.810233

D33

Jm 0.552951 0.658062

Q 0.422615 0.519175

Table 9Quality values associated with ionosphere.data.


Em 1.847881 1.775818

D12

Jm 0.378013 0.427778

Q 0.530181 0.547461Em 1.021551 1.070882

D22

Jm 0.495897 0.523118

Q 0.880628 0.839051Em 2.468671 2.348022

D32

Jm 0.270314 0.309524

Q 0.402669 0.422237


degree. Thus, the expert can determine, a priori, what is the bestset of patterns without analyzing them independently. Throughthe quality Q-index, the expert can compare, objectively, sets ofpatterns, discarding those with less quality in terms of certainty.

Tables 6–9 show the results of the proposed evaluation methodfor two different minimum frequency values. For each dataset Dj

i,obtained from the database Dbi, using one of the three discretiza-tion techniques dj, we compute the Q-index for each set of frequentpatterns from Em and Jm information measures, considering themas bodies of evidence or belief structures.

In diabetes.data database case, Table 6 shows the results ob-tained for each body of evidence. As it can be seen, the best prepro-cessing technique is the discretization method d1, as this is the onethat gives rise to a set of frequent patterns with high quality index.In the case of the sonar.data database, as it can be seen in Table 7,the best discretization technique is d2. And finally, as it is shown inTable 8, vehicle.data coincides with the above database, since d2 isalso the technique that results in a more informative set ofpatterns.

In all cases, the Q-index gives us an objective value for excludingbodies of evidence, allowing the expert to focus the evaluation andinterpretation phase on a single dataset, the most informative one.

It is noteworthy the case D32, i.e., the dataset obtained after

applying d3 discretization technique to the sonar.data database:the number of relevant attributes is equal to 1 and it leads to ob-tain, regardless of minsup values, 10 frequent patterns (see Tables4, and 5), whose length is all equal to 1 (1-patterns). This impliesthat all the frequent patterns are singletons, which correspondsto a Bayesian structure in terms described in Yager (2008). In sucha case,

SpðmÞ ¼Xx2I

mðfxgÞjfxgj ¼

Xx2I

mðfxgÞ1

¼ 1

Table 6Quality values associated with diabetes.data.


Em 1.336641 1.298514

D11

Jm 0.493119 0.523118

Q 0.701901 0.714324Em 1.431991 1.432484

D21

Jm 0.305026 0.309524

Q 0.683005 0.682341Em 2.484528 2.353875

D31

Jm 0.271432 0.309524

Q 0.400110 0.421205

So, Jm(m) = 1 � Sp(m) = 0 (equal to the value obtained empirically).Furthermore, when we deal with this kind of structures, Em is re-duced to Shannon entropy, that is:

EmðmÞ ¼ �Xx2I

mðfxgÞlnðmðfxgÞÞ:

The case of maximum entropy is obtained when we deal with k dis-joint elements, and m(x) = 1/k, where k is the number of frequent 1-patterns. In this case,

EmðmÞ ¼ �Xk

i¼1

1k

ln1k

� �¼ lnðkÞ

As k = 10 (the number of frequent patterns), Em(m) = ln(10) =2.302585 � 2.302398 (close to the value obtained empirically).Practically, this is an extreme case that matches with the expectedproperties of the information measures.

As a final consideration, one issue of interest and that would beraised as an interesting work is the study about the relationshipbetween quality and accuracy in the framework of the Theory ofEvidence. Although classification and frequent pattern mining aretwo very different tasks, the relation between them has been stud-ied in literature. In this sense, we can highlight the review of asso-ciative classification presented in Thabtah (2007).

Table 10Accuracy vs. quality index.

Database Dataset Acc1 Acc2 Q

D11

74.8698 96.1350 0.714324

diabetes.data D21

68.8802 69.9169 0.682341

D31

74.7396 88.5755 0.421205

D12

75.9615 62.4354 0.533923

sonar.data D22

78.3654 97.9886 0.785866

D32

74.5192 51.7009 0.434330

D13

70.5674 95.6041 0.781779

vehicles.data D23

70.5674 98.0816 0.905121

D14

69.6217 97.8779 0.519175

D24

90.5983 95.9417 0.547461

ionosphere.data D34

91.4530 97.0030 0.839051

D34

90.0285 92.6218 0.422237


With the aim to test, empirically, the usefulness of the qualitymeasure proposed in this paper, Table 10 summarizes the Q valueof each discretized dataset versus the accuracy value obtained withtwo different kind of classifiers implemented in WEKA. The Acc1

attribute shows the classification accuracy value of a decision treeclassifier (J.48) with a 10-fold cross-validation test. Acc2 denotesthe predictive accuracy of a set formed by class association rules.The associative classifier is the Predictive-Apriori algorithm(Scheffer, 2001), designed to obtain the n most predictive rules ina Bayesian framework. In our experiments, we have obtained the100 most accurate classification-based rules, assigning to Acc2

the average accuracy value. Finally, Q denotes the quality indexproposed in Eq. (11).

As it can be seen, Q-index correlates with accuracy in every casereported, showing it maximum value in those datasets which re-sult in a more accurate classifier, either J.48 or the associativeone. These results strength, empirically, the usefulness of themethod which can be proposed as a new evidence-based tool forassessing frequent itemset-based patterns.

5. Conclusions and future work

In this paper we have presented a formal method to assess,objectively, a set of frequent patterns which show the regularitiespresented in an input dataset. The method is based on the use oftwo information measures proposed by Yager in the context ofthe Shafer’s Theory of Evidence. The proposal involves the combineduse of both measures, an entropy-like measure (Em), and the (non)specificity-like measure (Jm), in order to quantify objectively thequality level (certainty) of bodies of evidence. Our approach isbased on considering the whole set of mined patterns as a bodyof evidence. Thus, it is possible the use of information measuresto provide objective values which assist experts in the evaluationand the interpretation of frequent patterns and, therefore, in thediscovery of useful information from data. As future work, wewould like to extend the study taking into account different infor-mation measures like the total uncertainty measure proposed byPal, Bezdek, and Hemasinha (1992, 1993).

Acknowledgements

This work was partly funded by Spanish Ministry of Science andInnovation Project TIN2009-14372-C03-01 and the Junta de And-alucía (Andalusia Regional Government) Excellence Project P07-SEJ-03214.

References

Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules betweensets of items in large databases. In Proceedings of the ACM SIGMOD internationalconference on management of data. ACM Press.

Bayardo, R. J. (1998). Efficiently mining long patterns from databases. In Proceedingsof the ACM SIGMOD international conference on management of data (SIGMOD’98).ACM Press.

Berzal, F., Blanco, I., Sanchez, D., & Vila, M. (2002). Measuring the accuracy andinterest of association rules: A new framework. Intelligent Data Analysis, 6(3),221–235.

Beynon, M., Cosker, D., & Marshall, D. (2001). An expert system for multi-criteriadecision making using Dempster–Shafer theory. Expert Systems withApplications, 20(4), 357–367.

Dubois, D., & Prade, H. (1985). A note on measures of specificity for fuzzy sets.International Journal of General Systems, 10, 279–283.

Dubois, D., Prade, H., & Yager, G. (1999). Merging fuzzy information. In Fuzzy sets inapproximate reasoning and information systems. Kluwer Academic Publishers..

Elomaa, T., & Rousu, J. (2004). Efficient multisplitting revisited: Optima-preservingelimination of partition candidates. Data Mining and Knowledge Discovery, 8(2),97–126.

Fayyad, U., Piatetky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledgediscovery in databases. AIMagazine, 17(3), 37–54.

Frank, A., & Asuncion, A. (2010). UCI machine learning repository. <http://archive.ics.uci.edu/ml>.

Guil, F., Palacios, F., Campos M., & Marín, R. (2010). On the evaluation of minedfrequent sequences. An evidence theory-based method, In Proceedings of the 3thinternational conference on health informatics (HEALTHINF’10).

Guil, F., & Marín, R. (2012). A tree structure for event-based sequence mining.Knowledge-Based Systems, 35, 186–200.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. (2009). TheWEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18.

Luo, H., Yang, S. L., Hu, X. J., & Hu, X. X. (2012). Agent orient intelligent faultdiagnosis system using evidence theory. Expert Systems with Applications, 39(3),2524–2531.

Pal, N., Bezdek, J., & Hemasinha, R. (1992). Uncertainty measures for evidentialreasoning I: A review. International Journal of Approximate Reasoning, 7,165–183.

Pal, N., Bezdek, J., & Hemasinha, R. (1993). Uncertainty measures for evidentialreasoning II: A new measure of total uncertainty. International Journal ofApproximate Reasoning, 8, 1–16.

Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). Discovering frequent closeditemsets for association rules. In Proceedings of the 7th international conferenceon database theory (ICDT’99). Lecture Notes in Computer Science (Vol. 1540).Springer.

Scheffer T. (2001). Finding association rules that trade support optimally againstconfidence. In Proceedings of the 5th European conference on principles of datamining and knowledge discovery (pp. 424–435).

Shafer, G. (1976). A mathematical theory of evidence. Princenton, NJ: PrincentonUniversity Press.

Thabtah, F. (2007). A review of associative classification mining. The KnowledgeEngineering Review, 22(1), 37–65.

Yager R. (1981). Measurement of properties of fuzzy sets and possibilitydistributions. In Proceedings of the third international seminar on fuzzy sets.

Yager, R. (1983). Entropy and specificity in a mathematical theory of evidence.International Journal of General Systems, 9, 249–260.

Yager, R. R. (2008). Entropy and specificity in a mathematical theory of evidence. InClassic works of the Dempster–Shafer theory of belief functions (pp. 291–310).Springer.

Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353.Zadeh, L. A. (1978). Fuzzy sets as a basis for a therory of possibility. Fuzzy Sets and

Systems, 1(1), 3–28.

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

a theory of evidence-based method for assessing frequent patterns

Documents