the integration of fuzzy logic and graph search for sequential

THE INTEGRATION OF FUZZY LOGIC AND GRAPH SEARCH

FOR SEQUENTIAL PATTERN MINING

SUKANYA YUENYONG

A THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR

THE DEGREE OF MASTER OF SCIENCE

(TECHNOLOGY OF INFORMATION SYSTEM MANAGEMENT)

FACULTY OF GRADUATE STUDIES

MAHIDOL UNIVERSITY

2008

COPYRIGHT OF MAHIDOL UNIVERSITY

ACKNOWLEDGEMENTS

I would like to express my sincere to my major adviser, Assist. Prof. Dr. Pisit

Phokharatkul, for his kindness, valuable advice and numerous suggestions.

I would like to thank my co-advisers Dr.Rangsipan Marukatat and

Dr.Noppadol Wanichworanant. I profoundly thank them for their valuable

recommendations.

I would like to thank Dr. Bunlur Emaruchi, Chair committee, for his kindness

and all supports.

I would like to thank Assoc. Prof. Dr. Chom Kimpan, who was the external

examiner of the thesis defense for his suggestion and comments.

I would like to thank Mr. Noppadol Teeranachaideekul for his suggestion. And

thank to my best friends for encouragement.

Finally, I am thankful to my mother and my uncle for their supports,

encouragement and love.

Sukanya Yuenyong

Fac. of Grad. Studies, Mahidol Univ. Thesis /

iv

THE INTEGRATION OF FUZZY LOGIC AND GRAPH SEARCH FOR

SEQUENTIAL PATTERN MINING

SUKANYA YUENYONG 4837225 EGTI / M M.Sc. (TECHNOLOGY OF INFORMATION SYSTEM MANAGEMENT) THESIS ADVISORS: PISIT PHOKHARATKUL, D.Eng., NOPPADOL

WANICHWORANANT, Ph.D., AND RUNGSIPAN MARUKATAT, Ph.D.

ABSTRACT

Sequential pattern discovery is an important problem in data mining. In recent

years, there have been many researchers trying to find new techniques to extract the

sequential patterns from a large database. In this research, an effective way of the

integrating fuzzy logic and graph search methods to create the fuzzy logic and graph

search (FGS) algorithm for sequential pattern mining is proposed. The execution time

of the two graph search techniques was compared. It was found that the depth-first

search (DFS) takes less execution time than the breadth-first search (BFS). Also, the

FGS algorithm takes less execution time than the GST algorithm when the k-sequence

is greater than or equal to the 1-sequence (k≥2). The outcomes of the FGS algorithm

are more valuable than the GST algorithm because the quantitative values of each

transaction are considered. Finally, it was found that the FGS outcomes are

substantially lower than the GST outcomes. Sometimes, the reduction is an advantage

but it may not be so for all cases.

KEY WORDS: DATA MINIG / SEQUENTIAL PATTERN / FUZZY LOGIC /

GRAPH SEARCH

46 pp.

Fac. of Grad. Studies, Mahidol Univ. Thesis /

v

การทําเหมืองขอมูลที่มีรูปแบบโดยลําดับดวยวิธีบูรณาการของฟซซีลอจิกและการคนหาแบบกราฟ (THE INTEGRATION OF FUZZY LOGIC AND GRAPH SEARCH FOR SEQUENTIAL PATTERN MINING)

สุกัญญา ยืนยง 4837225 EGTI / M

วท.ม. (เทคโนโลยีการจัดการระบบสารสนเทศ) คณะกรรมการควบคุมวิทยานิพนธ: พิศิษฏ โภคารัตนกลุ, D.Eng., นภดล วณิชวรนนัท , Ph.D.,

รังสิพรรณ มฤคทัต, Ph.D.

บทคัดยอ

การหารูปแบบโดยลําดับเปนปญหาที่สําคัญอันหนึ่งของการทําเหมืองขอมูล ตลอดระยะเวลาที่ผานมามีนักวิจัยมากมายยังคงพยายามที่จะหาเทคนิควิธีใหมๆเพื่อจะนําเอารูปแบบโดยลําดับออกมาจากฐานขอมูลขนาดใหญ งานวิจัยนี้นําเสนอประสิทธิภาพของวิธีการบูรณาการของฟซซีลอจิกและการคนหาแบบกราฟสําหรับทําเหมืองขอมูลท่ีมีรูปแบบโดยลําดับ และเปรียบเทียบเวลาในการประมวลผลของการคนหาแบบกราฟ 2 วิธี พบวาวิธีการคนหาเชิงลึกกอน(depth-first

search) ใชเวลาในการประมวลผลนอยกวาการคนหาเชิงกวางกอน (breadth-first search)

นอกจากนี้วิธีการบูรณาการของฟซซีลอจิกและการคนหาแบบกราฟยังใชเวลาในการประมวลผลนอยกวาการใชเทคนิคการคนหาแบบกราฟเพียงอยางเดียว และยังใหผลลัพธที่ทรงคุณคามากกวาเพราะวิธีการที่นําเสนอนี้พิจารณาถึงปริมาณของแตละรายการในฐานขอมูลดวย ซ่ึงยังผลใหสามารถลดรูปแบบลําดับของเหตุการณไดดวย จากกรณีดังกลาวบางครั้งการลดจํานวนรูปแบบเหตุการณนับเปนขอดี แตมันอาจจะเปนขอเสียในบางกรณีก็ได

46 หนา.

vii

CONTENTS

Page

ACKNOWLEDGEMENTS iii

ABSTRACT iv

LIST OF TABLES viii LIST OF FIGURES ix CHAPTER

I INTRODUCTION

1.1 General Introduction 1

1.2 Statement of Problem 1

1.3 Objective of work 2

1.4 Scope of work 2

1.5 Expected Result 3

II LITERATURE REVIEW

2.1 Related Researches 4

2.2 Data Mining 4

2.3 Sequential Pattern 6

2.4 Fuzzy Set 10

2.5 Graph Theory 16

III MATERIALS AND METHODS

3.1 Research Methodology and Procedure 23

3.2 Materials 33

IV RESULTS

4.1 Simulation Model 34

4.2 Experimental Results 34

vii

CONTENTS (Cont.)

CHAPTER Page

V DISCUSSION

5.1 Comparison between algorithms 41

VI CONCLUSION AND RECOMMENDATIONS 6.1 Conclusion of this work 43

6.2 Recommendations 43

REFERENCES 45

BIOGRAPHY 46

LIST OF TABLES

TABLES Page

Table 2.1 Sorted Transaction Data 8

Table 2.2 Large Item sets Minimum Support = 40% 9

Table 2.3 Transformed Database 9

Table 2.4 Example values 12

Table 3.1 The data table of 1-sequence 23

Table 3.2 The data table after transform the quantities values 26

Table 3.3 The data table after transform the quantities 27

values of customer id ‘1000’

Table 3.4 All2-sequence of customer id ‘1000’ 27

Table 3.5 The data table of 2-sequence 28

Table 3.6 The relational graph table structure 28

Table 3.7 The relational graph table 29

Table 3.8 The sequential patterns table 30

Table 3.9 The sequential patterns table with confidence values 30

Table 4.1 The outcomes of the FGS algorithm total 6 Patterns 35

LIST OF FIGURES

Figures Page

Figure 2.1 Knowledge Discovery in Database processes 5

Figure 2.2 Bivalent Sets to Characterize the Temp. of a room 10

Figure 2.3 Fuzzy sets to characterize the Temp. of a room 11

Figure 2.4 A membership function based on the person's age 12

Figure 2.5 Fuzzy set Union 14

Figure 2.6 Fuzzy set Intersection 15

Figure 2.7 Fuzzy set Complement 16

Figure 2.8 Gives a pictorial view of this graph. 17

The edge (x,x) is called a self-loop

Figure 2.9 Example of a directed graph 17

Figure 2.10 Example of an undirected graph 18

Figure 2.11 The Edge List Graph Representation 19

Figure 2.12 Breadth-first search spreading through a graph 20

Figure 2.13 Depth-first search on an undirected graph 22

Figure 3.1 Membership functions 24

Figure 3.2 The relational graph 29

Figure 4.1 The environment C250.I10.D5000 33

Figure 4.2 The environment C500.I20.D5000 37

Figure 4.3 The environment classify by gender 38

Figure 4.4 The environment classify by age 38

Fac. Of Grad. Studies, Mahidol Univ. M.Sc.(Tech. of Info. Sys.Management) / 1

CHAPTER I

INTRODUCTION

1.1 General Introduction

Nowadays, there is too much information everywhere in daily life and there are

many data collections. Computers are a very popular tool. We use computers to collect

data in various formats such as text file, database and XML formats. The advantages

of data collection are more than search and review. We can extract desirable

knowledge from the existing data; we call that, data mining. Data mining is the

process of extracting interesting information or patterns from large information

repositories. Recently, data mining has been recognized as a new area and has been

growing at a rapid pace. Due to the rapid growth of data, these new techniques of data

mining are urgently requested.

There are many types of data mining. Sequential pattern discovery is an

important problem in data mining. In recent years there have been and continue to be

many researchers trying to find new techniques to extract the sequential patterns from

large database. They used difference algorithms such as: AprioriAll, Generalized

Sequential Pattern (GSP), PrefixSpan, SPADE, Graph Search Techniques, MEMory

Indexing for Sequential Pattern mining (MEMISP), Sequential Pattern minIng with

Regular expressIon consTraints (SPIRIT), Fuzzy, Multi-Dimensional Sequential

Pattern Mining, Incremental Mining of Sequential Patterns and Periodic Pattern

Analysis. Each algorithm has both advantages and disadvantages. The researcher tries

to amend the disadvantages and improve its advantages.

1.2 Statement of Problem A problem of finding sequential patterns is what the customer will buy after they

already bought some item or item set. That problem is concerned with inter-

transaction patterns.

Sukanya Yuenyong Introduction / 2

There are two important things in mining sequential patterns. First is the

performance of execution time. Secondary is the result; how can extract the most

useful sequential patterns. This problem can construe into many issues depending on

the interest of an individual such as: case of inventory, mining sequential pattern use

for predicting the consumer purchasing behavior. The outcome of sequential pattern

mining can predict what the next product or group of products will be purchased when

the product or group of products already purchased is known.

Although the algorithms always extract the sequential pattern, the user will

always desire a better pattern. However, new algorithms must continue to solve the

two important constraints of execution time and give a better result.

1.3 Objective of work

The objective of this work:

1.3.1 To develop how to employ fuzzy logic and graph search

algorithm in a subject of sequential pattern mining.

1.3.2 To benchmark a performance of sequential pattern mining using

the integration of fuzzy logic and graph search algorithm.

1.4 Scope of work

The scope of this work will be:

1.4.1 Comparing the execution time of sequential pattern mining

between depth-first search and breadth-first search.

1.4.2 Sample data in this work use the generation of the synthetic data

in the transaction database.

1.4.3 This research does not involve data cleaning and data

preprocessing.


1.5 Expected Result

The outcomes of this work will be the new algorithm to mine sequential patterns

employed by the integration of fuzzy logic and graph search to be invented. The

advantages and disadvantages of this algorithm will be analyzed.

Sukanya Yuenyong Literature Review / 4

CHAPTER II

LITERATURE REVIEW

2.1 Related Researches

Mining sequential pattern was first introduced in “Mining sequential patterns” [4]

by Agrawal. Recently years the researcher tried to find new technique for extract the

sequential patterns from large database. They used difference algorithms such as

AprioriAll and AprioriSome, DSG algorithm, fuzzy algorithm and the algorithm

mining path traversal pattern adopted an Apriori-like method to find sequential

patterns. However, it is very costly to generate candidate sets and repeatedly scans the

database.

Graph Search Techniques [6] give the advantage over the Apriori-like algorithm.

This algorithm can generate large sequences without constructing candidate sequences

and generate large k-sequences without following from large (k-1)-sequences step-by-

step. Besides, this algorithm considers time constraints as well. Time constraints make

the found sequence patterns more useful.

However, this algorithm does not consider the quantitative of each transaction. If

the patterns can tell the quantitative that will be valuable more than other. These

patterns can apply to many issues such as catalog design, store layout, customer

segmentation, cross marketing strategies, cross inventory strategies, effectiveness of

promotional campaigns etc.

Due to the reason above, this work will propose the new algorithm for sequential

patterns mining. We will consider the quantity and time constraints.

2.2 Data Mining

Data Mining [1] is the process of extracting interesting (non-trivial, implicit,

previously unknown and potentially useful) information or patterns from large

Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys. Management) / 5

information repositories such as: relational database, data warehouses, XML

repository, etc. Also data mining is known as one of the core processes of Knowledge

Discovery in Database (KDD). The KDD processes are shown in Figure 2.1

Graphical User Interface

Pattern Evaluation

Data Mining Tools

Data Repositories

Database Data Warehouse Other Repositories

Data Cleaning & Integration

Knowledge base

Figure 2.1 Knowledge Discovery in Database processes.

Knowledge Discovery in Database processes, first we need to clean and integrate

the databases. Since the data source may come from different databases, which may

have some inconsistence and duplications, we must clean the data source by removing

those noises or make some compromises. Suppose we have two different databases,

different words are used to refer the same thing in their schema. When we try to


integrate the two sources we can only choose one of them, if we know that they denote

the same thing. And also real world data tend to be incomplete and noisy due to the

manual input mistakes. The integrated data sources can be stored in a database, data

warehouse or other repositories.

As not all the data in the database are related to our mining task, the second

process is to select task related data from the integrated resources and transform them

into a format that is ready to be mined. Suppose we want to find which items are often

purchased together in a supermarket, while the database that records the purchase

history may contains customer ID, items bought, transaction time, prices, number of

each items and so on, but for this specific task we only need items bought. After

selection of relevant data, the database that we are going to apply our data mining

techniques to will be much smaller, consequently the whole process will be more

efficient.

Various data mining techniques are applied to the data source; different

knowledge comes out as the mining result. That knowledge is evaluated by certain

rules, such as the domain knowledge or concepts. After the evaluation, as shown in

Figure 2.1, if the result does not satisfy the requirements or contradicts with the

domain knowledge, we have to redo some processes until getting the right results.

Depending on the evaluation result we may have to redo the mining or the user may

modify his requirements. After we get the knowledge, the final step is to visualize data

cubs or 3D graphics. This process is try to make the data mining results easier to be

used and more understandable.

2.3 Sequential Pattern

Sequential Pattern [1] is a sequence of item sets that frequently occurred in a

specific order, all items in the same item sets are supposed to have the same

transaction time value or within a time gap. Usually all the transactions of a customer

are together viewed as a sequence, usually called customer-sequence, where each

transaction is represented as an item sets in that sequence, all the transactions are list

in a certain order with regard to the transaction-time.


2.3.1 Support

A customer support a sequence s if s is contained in the corresponding

customer sequence, the support of sequence s is defined as the fraction of customers

who support this sequence.

Support(s) = Number of support customers

Total number of customers

Example: There are 10 customers. If we define the minimum support at 20%

that means every kind of item in each transaction must purchase by customer at least

20% = Number of support customers 10

Number of support customers = 20% * 10

= 0.2*10

= 2 Persons

2.3.2 Sequential support-confidence (S-confidence) [11]

Support confidence (s-confidence) of a sequential pattern S = <s1, s2, …, sm> and sj is (xj1xj2…xjk), where xit is an item in the itemset si, denoted as sequential s-confidence, is a measure that reflects the overall support affinity among items within the sequence. It is the ratio of the minimum support of items within this pattern to the maximum support of items within the sequential pattern. That is, this measure is de-fined as

Min

1 ≤m' ≤m, 1 ≤k' ≤length(s m'){support ({x

m' k' ⊆s

m'})}

S-conf (S) = Max

1 ≤ m' ' ≤m, 1 ≤k' ' ≤length (s m' '){support({x

m' ' k' ' ⊆s

m' ' })}

2.3.3 Sequential support affinity pattern [11] A sequential pattern is a sequential support affinity pattern if the s-confidence of the sequential pattern is no less than a minimum s-confidence


(min_sconf). In other words, a sequential pattern S is a sequential support affinity pattern if and only if |S| > 0 and s-confidence (S) ≥ min_sconf.

Example: consider a pattern S = {(AB)(AC)(ABC)(AE)} and S` =

{(BC)(BD)(BCD)(BF)}. Assume that a min_sconf is 0.5, support ({A}) = 2, support

({B}) = 5, support ({C}) = 8, support ({D}) = 4, support ({E}) = 5, and support ({F})

= 6, where support (X) is the support value of a sequential pattern X. Then, the

sequential s-confidence (S) is 0.25 (2/8) and s-confidence (S`) is 0.5 (4/8). Therefore,

sequential pattern S is not a sequential support affinity pattern but pattern S` is a

sequential support affinity pattern.

2.3.4 Sequential Pattern Mining

Sequential Pattern Mining is the process of extracting certain sequential

patterns whose support exceeds a predefined minimal support threshold. Since the

number of sequences can be very large, and users have different interests and

requirements, to get the most interesting sequential patterns, usually a minimum

support is pre-defined by users. By using the minimum support we can prune out those

sequential patterns of no interest, consequently make the mining process more

efficient. Obviously a higher support of sequential pattern is desired for more useful

and interesting sequential patterns.

Table 2.1 Sorted Transaction Data

Customer-id Transaction-time Purchased-items

1

1

Oct 23’ 02

Oct 28’ 02

30

90

2

2

2

Oct 18’ 02

Oct 21’ 02

Oct 27’ 02

10, 20

30

40, 60, 70

3 Oct 15’ 02 30, 50, 70

4

4

Oct 08’ 02

Oct 16’ 02

30

40, 70


Table 2.1 Sorted Transaction Data (Continued)

Customer-id Transaction-time Purchased-items

4 Oct 25’ 02 90

5 Oct 20’ 02 90

Table 2.2 Large Item sets Minimum Support = 40%

Large Item sets Mapped To

(30)

(40)

(70)

(40,70)

(90)

1

2

3

4

5

Table 2.3 Transformed Database

Customer-

id

Customer Sequence Transformed DB After Mapping

1

2

3

4

5

<(30)(90)>

<(10,20)(30)(40,60,70)>

<(30,50,70)>

<(30)(40,70)(90)>

<(90)>

<{(30)}{(90)}>

<{(30)}{(40)(70)(40,70)}>

<{(30)(70)}>

<{(30)}{(40)(70)(40,70)}{(90)}>

<{(90)}>

<{1}{5}>

<{1}{2,3,4}>

<{1,3}>

<{1}{2,3,4}{5}>

<{5}>

Sequential pattern mining is used in a great spectrum of areas. In computational

biology, sequential pattern mining is used to analyze the mutation patterns of different

amino acids. Business organizations use sequential pattern mining to study customer

behaviors. Sequential pattern mining is also used in system performance analysis and

telecommunication network analysis.


2.4 Fuzzy Set

Fuzzy Set Theory [7] was formalized by Professor Lofti Zadeh at the University

of California in 1965. What Zadeh proposed is very much a paradigm shift that first

gained acceptance in the Far East and its successful application has ensured its

adoption around the world.

A paradigm is a set of rules and regulations which defines boundaries and tells us

what to do to be successful in solving problems within these boundaries. For example

the use of transistors instead of vacuum tubes is a paradigm shift - likewise the

development of Fuzzy Set Theory from conventional bivalent set theory is a paradigm

shift.

Bivalent Set Theory can be somewhat limiting if we wish to describe a

'humanistic' problem mathematically. For example, Fig 2.1 below illustrates bivalent

sets to characterize the temperature of a room.

Figure 2.2 Bivalent Sets to Characterize the Temp. of a room.

The most obvious limiting feature of bivalent sets that can be seen clearly from

the diagram is that they are mutually exclusive - it is not possible to have membership

of more than one set (opinion would widely vary as to whether 50 degrees Fahrenheit

is 'cold' or 'cool' hence the expert knowledge we need to define our system is

mathematically at odds with the humanistic world). Clearly, it is not accurate to define


a transition from a quantity such as 'warm' to 'hot' by the application of one degree

Fahrenheit of heat. In the real world a smooth (unnoticeable) drift from warm to hot

would occur.

This natural phenomenon can be described more accurately by Fuzzy Set Theory.

Fig. 2.2 below shows how fuzzy sets quantifying the same information can describe

this natural drift.

Figure 2.3 Fuzzy sets to characterize the Temp. of a room. The whole concept can be illustrated with this example. Let's talk about people

and "youngness". In this case the set S (the universe of discourse) is the set of people.

A fuzzy subset YOUNG is also defined, which answers the question "to what degree is

person x young?" To each person in the universe of discourse, we have to assign a

degree of membership in the fuzzy subset YOUNG. The easiest way to do this is with

a membership function based on the person's age.

young(x) ={1, if age(x) <= 20,(30-age(x))/10, if 20 < age(x) <= 30,0, if age(x) > 30 }


A graph of this looks like:

Figure 2.4 a membership function based on the person's age

Given this definition, here are some example values:

Table 2.4 Example values.

Person Age Degree of youth

Johan 10 1.00

Edwin 21 0.90

Parthiban 25 0.50

Arosha 26 0.40

Chin Wei 28 0.20

Rajkumar 83 0.00

So given this definition, we'd say that the degree of truth of the statement

"Parthiban is YOUNG" is 0.50.

Note: Membership functions almost never have as simple a shape as age(x). They

will at least tend to be triangles pointing up, and they can be much more complex than

that. Furthermore, membership functions so far are discussed as if they always are

based on a single criterion, but this isn't always the case, although it is the most

common case. One could, for example, want to have the membership function for

YOUNG depend on both a person's age and their height (Arosha's short for his age).

This is perfectly legitimate, and occasionally used in practice. It's referred to as a two-


dimensional membership function. It's also possible to have even more criteria, or to

have the membership function depend on elements from two completely different

universes of discourse.

2.4.1 Implementations[10]

For triangular and trapezoidal fuzzy set membership functions, let

a ≤ b ≤ c ≤ d denote characteristic points :

2.4.2 Fuzzy Set Operations.

2.4.2.1 Union

The membership function of the Union of two fuzzy sets A and B with

membership functions and respectively is defined as the maximum of the two

individual membership functions. This is called the maximum criterion.


Figure 2.5 Fuzzy set Union.

The Union operation in Fuzzy set theory is the equivalent of the OR operation in

Boolean algebra.

2.4.2.2 Intersection

The membership function of the Intersection of two fuzzy sets A and B

with membership functions and respectively is defined as the minimum of the

two individual membership functions. This is called the minimum criterion.


Figure 2.6 Fuzzy set Intersection

The Intersection operation in Fuzzy set theory is the equivalent of the

AND operation in Boolean algebra.

2.4.2.3 Complement

The membership function of the Complement of a Fuzzy set A with

membership function is defined as the negation of the specified membership

function. This is called the negation criterion.


Figure 2.7 Fuzzy set Complement

The Complement operation in Fuzzy set theory is the equivalent of the

NOT operation in Boolean algebra.

2.5 Graph Theory [8]

2.5.1 The Graph Abstraction

A graph is a mathematical abstraction that is useful for solving many kinds of

problems. Fundamentally, a graph consists of a set of vertices, and a set of edges,

where an edge is something that connects two vertices in the graph. More precisely, a

graph is a pair (V,E), where V is a finite set and E is a binary relation on V. V is called

a vertex set whose elements are called vertices. E is a collection of edges, where an

edge is a pair (u,v) with u,v in V. In a directed graph, edges are ordered pairs,

connecting a source vertex to a target vertex. In an undirected graph edges are

unordered pairs and connect the two vertices in both directions, hence in an undirected

graph (u,v) and (v,u) are two ways of writing the same edge.

This definition of a graph is vague in certain respects; it does not say what a

vertex or edge represents. They could be cities with connecting roads, or web-pages


with hyperlinks. These details are left out of the definition of a graph for an important

reason; they are not a necessary part of the graph abstraction. By leaving out the

details we can construct a theory that is reusable that can help us solve lots of different

kinds of problems.

Back to the definition: a graph is a set of vertices and edges. For purposes of

demonstration, let us consider a graph where we have labeled the vertices with letters,

and we write an edge simply as a pair of letters. Now we can write down an example

of a directed graph as follows:

V = {v, b, x, z, a, y } E = { (b,y), (b,y), (y,v), (z,a), (x,x), (b,x), (x,v), (a,z) } G = (V, E)

Figure 2.8 gives a pictorial view of this graph. The edge (x,x) is called a self-loop.

Edges (b,y) and (b,y) are parallel edges, which are allowed in a multigraph

(but are normally not allowed in a directed or undirected graph).

Figure 2.9 Example of a directed graph.

Next we have a similar graph, though this time it is undirected. Fig. 2.9 gives

the pictorial view. Self loops are not allowed in undirected graphs. This graph is the

undirected version (b,y)), meaning it has the same vertices and the same edges with

their directions removed. Also the self edge has been removed, and edges such as (a,z)

and (z,a) are collapsed into one edge. One can go the other way, and make a directed


version of an undirected graph be replacing each edge by two edges, one pointing in

each direction.

V = {v, b, x, z, a, y }

E = { (b,y), (y,v), (z,a), (b,x), (x,v) }

G = (V, E)

Figure 2.10 Example of an undirected graph.

Now for some more graph terminology. If some edge (u,v) is in graph , then

vertex v is adjacent to vertex u. In a directed graph, edge (u,v) is an out-edge of vertex

u and an in-edge of vertex v. In an undirected graph edge (u,v) is incident on vertices u

and v.

In Fig. 2.10, vertex y is adjacent to vertex b (but b is not adjacent to y). The

edge (b,y) is an out-edge of b and an in-edge of y. In Fig. 2.10, y is adjacent to b and

vice-versa. The edge (y,b) is incident on vertices y and b.

In a directed graph, the number of out-edges of a vertex is its out-degree and

the number of in-edges is its in-degree. For an undirected graph, the number of edges

incident to a vertex is its degree. In Fig. 2.9, vertex b has an out-degree of 3 and an in-

degree of zero. In Fig. 2.10, vertex b simply has a degree of 2.

Now a path is a sequence of edges in a graph such that the target vertex of

each edge is the source vertex of the next edge in the sequence. If there is a path

starting at vertex u and ending at vertex v we say that v is reachable from u. A path is


simple if none of the vertices in the sequence are repeated. The path <(b,x), (x,v)> is

simple, while the path <(a,z), (z,a)> is not. Also, the path <(a,z), (z,a)> is called a

cycle because the first and last vertex in the path are the same. A graph with no cycles

is acyclic.

A planar graph is a graph that can be drawn on a plane without any of the

edges crossing over each other. Such a drawing is called a plane graph. A face of a

plane graph is a connected region of the plane surrounded by edges. An important

property of planar graphs is that the number of faces, edges, and vertices are related

through Euler's formula: |F| - |E| + |V| = 2. This means that a simple planar graph has

at most O (|V|) edges.

2.5.2 Edge List Representation

An edge-list representation of a graph is simply a sequence of edges, where

each edge is represented as a pair of vertex ID's. The memory required is only O (E).

Edge insertion is typically O (1), though accessing a particular edge is O (E) (not

efficient). Fig. 2.11 shows an edge-list representation of the graph in Fig. 2.11 The

edge_list adaptor class can be used to create implementations of the edge-list

representation.

Figure 2.11 The Edge List Graph Representation.

2.5.3 Graph Search Algorithms

Tree edges are edges in the search tree (or forest) constructed (implicitly or

explicitly) by running a graph search algorithm over a graph. An edge (u,v) is a tree

edge if v was first discovered while exploring (corresponding to the visitor explore()

method) edge (u,v). Back edges connect vertices to their ancestors in a search tree. So

for edge (u,v) the vertex v must be the ancestor of vertex u. Self loops are considered

to be back edges. Forward edges are non-tree edges (u,v) that connect a vertex u to a


descendant v in a search tree. Cross edges are edges that do not fall into the above

three categories.

2.5.3.1 Breadth-First Search

Breadth-first search is a traversal through a graph that touches all of the

vertices reachable from a particular source vertex. In addition, the order of the

traversal is such that the algorithm will explore all of the neighbors of a vertex before

proceeding on to the neighbors of its neighbors. One way to think of breadth-first

search is that it expands like a wave emanating from a stone dropped into a pool of

water. Vertices in the same ``wave'' are the same distance from the source vertex. A

vertex is discovered the first time it is encountered by the algorithm. A vertex is

finished after all of its neighbors are explored. Here's an example to help make this

clear. A graph is shown in Fig. 2.12 and the BFS discovery and finish order for the

vertices is shown below.

Figure 2.12 Breadth-first search spreading through a graph.

Order of discovery: s r w v t x u y

Order of finish: s r w v t x u y

We start at vertex, and first visit r and w (the two neighbors of). Once

both neighbors of are visited, we visit the neighbor of r (vertex v), then the neighbors

of w (the discovery order between r and w does not matter) which are t and x. Finally

we visit the neighbors of t and x, which are u and y.


2.5.3.2 Depth-First Search

A depth first search (DFS) visits all the vertices in a graph. When

choosing which edge to explore next, this algorithm always chooses to go ``deeper''

into the graph. That is, it will pick the next adjacent unvisited vertex until reaching a

vertex that has no unvisited adjacent vertices. The algorithm will then backtrack to the

previous vertex and continue along any as-yet unexplored edges from that vertex.

After DFS has visited all the reachable vertices from a particular source vertex, it

chooses one of the remaining undiscovered vertices and continues the search. This

process creates a set of depth-first trees which together form the depth-first forest. A

depth-first search categorizes the edges in the graph into three categories: tree-edges,

back-edges, and forward or cross-edges (it does not specify which). There are typically

many valid depth-first forests for a given graph, and therefore many different (and

equally valid) ways to categorize the edges.

One interesting property of depth-first search is that the discover and

finish times for each vertex form a parenthesis structure. If we use an open-parenthesis

when a vertex is discovered and a close-parenthesis when a vertex is finished, then the

result is a properly nested set of parenthesis. Fig. 2.13 shows DFS applied to an

undirected graph, with the edges labeled in the order they were explored. Below we

list the vertices of the graph ordered by discover and finish time, as well as show the

parenthesis structure. DFS is used as the kernel for several other graph algorithms,

including topological sort and two of the connected component algorithms. It can also

be used to detect cycles.


Figure 2.13 Depth-first search on an undirected graph.

Order of discovery: a b e d c f g h i

Order of finish: d f c e b a

Parenthesis: (a (b (e (d d) (c (f f) c) e) b) a) (g (h (i i) h) g)


CHAPTER III

MATERIALS AND METHODS

3.1 Research Methodology and Procedure

In this work, we integrate two algorithms that are fuzzy logic and graph search for

sequential pattern mining. The first we assume the fuzzy membership functions for

transform the quantitative values. Next we mine the sequential patterns by two graph

search algorithm. They are depth-first search (DFS) and breadth-first search (BFS).

The methods of sequential patterns mining step by step as follow;

1. Prepare sample data.

2. Assume the fuzzy membership functions.

3. Transform the quantitative values.

4. Create 2-sequences from 1-sequences table.

5. Design and construct the relational graph table.

6. Search the sequential patterns from the relational graph table.

7. Calculate the sequential patterns confidence values.

8. Evaluate the performance of algorithm.

3.1.1 Prepare sample data.

For sample data, we prepare the generation of the synthetic data in the

transaction database by Advanced Data Generator program. We use the notation C for

sum of customers, I for sum of kind of items and D for sum of transactions. For

example, C500.I5.D5000 represents the simulation environment with 500 customers

who purchased items, 5 kinds of item, and 5000 transactions. We use the various

experimental data to evaluate the output.

Sukanya Yuenyong Materials and methods/24

The structure of the data table (1-sequences) as follows:

Table 3.1 the data table of 1-sequence

Purchased Item Customer Id Transaction Time Quantities

C 1000 20/3/2550 10

A 1000 14/4/2550 9

E 1000 30/3/2550 7

B 1000 24/5/2550 2

C 1001 15/12/2550 10

A 1001 16/12/2550 9

A 1001 22/10/2550 8

B 1002 16/8/2550 5

C 1003 4/3/2550 1

C 1003 31/10/2550 10

A 1003 16/11/2550 9

D 1004 1/1/2550 2

A 1004 2/4/2550 7

E 1004 27/6/2550 11

D 1004 31/8/2550 1

3.1.2 Assume the fuzzy membership function.

In this work, the quantitative purchased are divide into three fuzzy regions:

Low, Middle and High. The example membership functions are show as follows:


0

1

3 6 12

low mid higvaluesMembership

Number of item

Figure 3.1 Membership functions.

µlow(x) = max (min ((6-x/6-3), 1), 0) µmid(x) = max (min (x-3/6-3, 12-x/12-6), 0) µhig(x) = max (min (x-6/12-6, 1), 0)

The fuzzy membership functions must define carefully. The number of

memberships and how many items in each membership are depend on each issue.

3.1.3 Transform the quantitative values.

We transform the quantitative values of each transaction datum into fuzzy

sets use the given membership functions above. The result of each membership

function called ‘degree of truth’ [12]. So the maximum value means the maximum

truth.

For example: The quantity of item C is 5

µLow (5) = max (min ((6-5/6-3), 1), 0)

= max (min (0.33, 1), 0)

= max (0.33, 0)

= 0.33 µMid (5) = max (min (5-3/6-3, 12-5/12-6), 0)

= max (min (0.67, 1.17), 0)

= max (0.67, 0)


= 0.67 µHig (5) = max (min (5-6/12-6, 1), 0)

= max (min (-0.17, 1), 0)

= max (-0.17, 0))

= 0

The maximum degree of truth = max (µLow (x), µMid (x), µHig (x))

= max (0.33, 0.67, 0)

= 0.67

So we transform the quantitative value of item C to ‘Mid’ because middle

function gives the maximum degree of truth.

However, the support count value of each transaction in the table must be

greater than or equal to the minimum support.

The support count value defines as the fraction of customer who supports this

sequence.

Support(C) = Number of support customers

Total number of customers

For example: There are 5 customers. If we define the minimum support at

20% that means the kind of item in each transaction must purchase by customers at

least

20% = Number of support customers 5

Number of support customers = 20% * 5

= 0.2*5

= 1 person

So we prune away the kind of item that have a unique customer purchased its

less than one person.


The structure of the data table as follows:

Table 3.2 the data table after transform the quantitative values.


C-hig 1000 20/3/2550 10

A-hig 1000 14/4/2550 9

E-mid 1000 30/3/2550 7

B-low 1000 24/5/2550 2

C-hig 1001 15/12/2550 10

A-hig 1001 16/12/2550 9

A-hig 1001 22/10/2550 8

B-mid 1002 16/8/2550 5

C-low 1003 4/3/2550 1

C-hig 1003 31/10/2550 10

A-hig 1003 16/11/2550 9

D-low 1004 1/1/2550 2

A-mid 1004 2/4/2550 7

E-hig 1004 27/6/2550 11

D-low 1004 31/8/2550 1

3.1.4 Create 2-sequences from 1-sequences table

We create 2-sequences from 1-sequence data table above by pair two items.

The first item must be purchase before the other. However, the support count value of

2-sequences in the table must greater than or equal to the minimum support and time

space between two sequences must less than the maximum interval value (time

constraint). For example, we define the max-interval value is 30 days. That means we

want to find the sequential patterns its have a time space between each item in the

patterns less than or equal to the maximum interval value (<= 30 days).


Table 3.3 the data table after transform the quantitative values of customer id ‘1000’.


C-hig 1000 20/3/2550 10

A-hig 1000 14/4/2550 9

E-mid 1000 30/3/2550 7

B-low 1000 24/5/2550 2

Table 3.4 2-sequence of customer id ‘1000’

Sequence Customer Id Start Time End Time Time Space

(C-hig)(A-hig) 1000 20/3/2550 14/4/2550 25

(C-hig)(E-mid) 1000 20/3/2550 30/3/2550 41

(C-hig)(B-low) 1000 20/3/2550 24/5/2550 35

(A-hig)(B-low) 1000 14/4/2550 24/5/2550 40

(E-mid)(A-hig) 1000 30/3/2550 14/4/2550 14

(E-mid)(B-low) 1000 30/3/2550 24/5/2550 24

The time space of (C-hig) (B-low), (A-hig) (B-low) and (E-mid) (B-low) are

more than max-interval that means they are uninteresting sequences.

Next, we prune away the 2-sequence that has support count value less than

the minimum support.

Table 3.5 the data table of 2-sequence

Sequence Customer Id Start Time End Time

(C-hig)(A-hig) 1000 20/3/2550 14/4/2550

(C-hig)(E-mid) 1000 20/3/2550 30/3/2550

(E-mid)(A-hig) 1000 30/3/2550 14/4/2550

(C-hig)(A-hig) 1001 15/12/2550 16/12/2550

(C-hig)(A-hig) 1003 31/10/2550 16/11/2550


Note: If start time equal to end time (purchased in the same day), we separate

the item name with space such as (C-hig A-mid).

3.1.5 Design and construct the relational graph table.

We design a relational graph table with the important properties. The

structure of the data table as follows:

Table 3.6 the relational graph table structure.

From_vertex To_vertex Edge_type Customer Id Start Time End Time

Note: the data type of ‘Edge_type’ is a Boolean, set true if start time not

equal to end time.

3.1.6 Search the sequential patterns from the relational graph table.

We search the sequential patterns from the relational graph table by two

graph search algorithm. They are depth-first search (DFS) and breadth-first search

(BFS) algorithm.

DFS follows the following rules:

Step 1: Select an unvisited node, visit it, and treat as the current node

Step 2: Find an unvisited neighbor of the current node by compare the

start time of an unvisited neighbor with the end time of current node, visit it,

and make it the new current node.

Step 3: If the current node has no unvisited neighbors, backtrack to its

parent, and make that the new current node then repeat the above two steps

until no more nodes can be visited.

Step 4: If there are still unvisited nodes, repeat from step 1.


Algorithm:

Procedure DFS(input: graph G) begin Stack S; Integer s,x; while (G has an unvisited node) do s := an unvisited node; visit(v); push(v,S); While (S is not empty) do x := top(S); if (x has an unvisited neighbor y) then visit(y); push(y,S); else pop(S); endif endwhile endwhile end

BFS follows the following rules:

Step 1: Select an unvisited node s, visit it, have it be the root in a BFS tree being formed. Its level is called the current level. Step 2: From each node x in the current level, in the order in which the level nodes were visited, visit all the unvisited neighbors of x. The newly visited nodes from this level form a new level that becomes the next current level. Step 3: Repeat the previous step until no more nodes can be vsisted. Step 4: If there are still unvisited nodes, repeat from Step 1.

Algorithm:

Procedure BFS(input: graph G) begin Queue Q; Integer s,x; while (G has an unvisited node) do s := an unvisited node; visit(s); Enqueue(s,Q); While (Q is not empty) do x := Dequeue(Q); For (unvisited neighbor y of x) do visit(y);


Enqueue(y,Q); endfor endwhile endwhile end

For example:

Table 3.7 the relational graph table.

From_vertex To_vertex Edge_type Customer Id Start Time End Time

C-hig A-hig T 1000 20/3/2550 14/4/2550

C-hig E-mid T 1000 20/3/2550 30/3/2550

E-mid A-hig T 1000 30/3/2550 14/4/2550

C-hig A-hig T 1001 15/12/2550 16/12/2550

C-hig A-hig T 1003 31/10/2550 16/11/2550

C-hig A-hig20/3/2550 > 14/4/2550

15/12/2550 > 16/12/255031/10/2550 > 16/11/2550

E-mid20/3/2550 > 30/3/2550

30/3/2550 > 14/4/2550

Figure 3.2 the relational graph.

Table 3.8 the sequential patterns table.

Sequential Patterns Customer Id

C-hig > A-hig 1000

C-hig > A-hig 1001

C-hig > A-hig 1003

C-hig > E-mid > A-hig 1000


Next, we prune away the 2-sequence that has support count value less than

the minimum support.

Sequential Patterns: 1. C-hig > A-hig

2. C-hig > E-mid > A-hig

3.1.7 Calculate the sequential patterns confidence values

Each pattern has difference confidence values. We calculate the confidence

value [11] from

Min {support ({x ⊆s })}

1 ≤m' ≤m, 1 ≤k' ≤length(s m') m' k' m'S-conf (S) =

Max {support({x ⊆s1 ≤ m' ' ≤m, 1 ≤k' ' ≤length (s m' ') m' ' k' ' m' '

})}

The support of C-hig, A-hig and E-mid are 0.6, 0.6 and 0.2 in ordered. So the

confidence value of C-hig > A-hig = (0.6/0/6) = 1 and the confidence value of C-hig >

E-mid > A-hig = (0.2/0.6) = 0.33

Table 3.9 the sequential patterns table with confidence values.

Sequential Patterns Confidence

C-hig > A-hig 1

C-hig > E-mid > A-hig 0.33

3.1.8 Evaluate the performance of algorithm.

We explore the execution time and the sequential patterns from the

integration of fuzzy logic and graph search algorithm under different simulation

environments and minimum support.


3.2 Materials

• Hardware : The specification of computer used for mining sequential

patterns consists of these following

• CPU Intel® Core™ 2 Duo T7300 (2.00 GHz.,

800 MHz. FSB, 4 MB L2)

• RAM 1024 MB. DDR2 667

• Hard Disk 160 GB 5400 RPM.

• Software:

• Microsoft Access

• Microsoft Visual Web Developer 2005 Express Edition

• Microsoft Word

• Advanced Data Generator

Sukanya Yuenyong Results / 34

CHAPTER IV

RESULTS

4.1 Simulation Model

We undertake several experiments on a computer laptop; Intel® Core™ 2 Duo

T7300 (2.00 GHz, 800 MHz. FSB, 4 MB L2), 1024 MB. DDR2-667, 160 GB 5400

RPM HDD. We use the integration of fuzzy logic and graph search algorithm to mine

sequential patterns. In the minable process, we use two graph search algorithm that are

depth-first search (DFS) and breadth-first search (BFS) to compare the execution time.

We generate the synthetic data by Advanced Data Generator program. Each

experiment, we use the notation C for sum of customers, I for sum of kind of item and

D for total transactions. For example, the experiment label is C500.I10.D5000

represents the simulation environment with 500 customers who purchased item, 10

kinds of item that purchased by customer and 5000 total transactions.

4.2 Experimental Results

Experiment 1:

We explore the execution time of graph search technique (GST) and the

integration of fuzzy logic and graph search (FGS). In this experiment we use two

graph search techniques that are depth-first search (DFS) and breadth-first search

(BFS). We experiment under the different simulation environment and minimum

supports, as shown in Fig. 4.1. Here we assume the fuzzy membership function are

low <= 3, middle = 10 and high >= 18 and the max-interval = 30 days. As result, we

found the DFS take the execution time less than BFS. Beside, the FGS algorithm takes

the execution time less than the GST algorithm when k-sequence more than 1-

sequence (k>=2). Although we change the value of total customers and the total kinds

of item, the FGS algorithm always take the execution time less than the GST

algorithm.

Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) /

35

Finally, we also found the high support value use the few execution times

because support value of k-sequence always less than the minimum support.

C250I10D5000

0500

1000150020002500300035004000

1 0.75 0.5 0.38 0.25 0.2

Minimum support

Exe

cutio

n tim

e

GST

DFS

BFS

Figure 4.1(a) the environment C250.I10.D5000

C500I10D5000

0200400600800

1000120014001600

1 0.75 0.5 0.38 0.25 0.2

Minimum support

Exe

cutio

n tim

e

GST

DFS

BFS

Figure 4.1(b) the environment C500.I10.D5000


C1000I10D5000

0

2

4

6

8

10

12

1 0.75 0.5 0.38 0.25 0.2

Minimum support

Exe

cutio

n tim

e

GST

DFS

BFS

Figure 4.1(c) the environment C1000.I10.D5000

C500I15D5000

0

5

10

15

20

25

1 0.75 0.5 0.38 0.25 0.2

Minimum support

Exe

cutio

n tim

e

GST

DFS

BFS

Figure 4.1(d) the environment C500.I15.D5000


37

C500I20D5000

0

5

10

15

20

25

1 0.75 0.5 0.38 0.25 0.2

Minimum support

Exe

cutio

n tim

e

GST

DFS

BFS

Figure 4.1(e) the environment C500.I20.D5000

Experiment 2:

We explore the outcomes of each algorithm are shown as table 4.1. Here the

experiment label is C250.I10.D5000 and we assume the fuzzy membership function

are low <= 3, middle = 10 and high >= 18, the max-interval = 30 days and the

minimum support = 20%, the FGS algorithm and the GST algorithm take the

execution time are 101.05 and 3,652.34 second in ordered and the outcomes of the

FGS algorithm given more feature than the GST algorithm. It inform about the

quantitative of each items in the sequential patterns. That gives more valuable

outcomes in many issues such as; catalog design, store layout, cross marketing

strategies, cross inventory strategies, effectiveness of promotional campaigns etc.

Beside, the FGS algorithm can reduce the number of outcomes patterns from 7860 to 6

patterns as well.

Table 4.1(a) the outcomes of the FGS algorithm total 6 Patterns

SequentialPatterns S-Confidence B-hig -> G-hig 0.9794 I-hig -> A-hig 0.9695 F-hig -> A-hig 0.8781 B-hig -> A-hig 0.8719 F-hig -> A-hig -> A-hig 0.5737 B-hig -> G-hig -> G-hig 0.5697


Table 4.1(b) the outcomes of the GST algorithm 22 of 7860 patterns

SequentialPatterns S-ConfidenceA -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8964D -> A -> A -> B -> B -> B -> B -> B -> C -> A -> E 0.8924D -> A -> A -> C -> C -> C -> C -> A -> A -> C -> I 0.8924D -> A -> B -> B -> E -> G -> A -> B -> B -> J 0.8924E -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8725E -> A -> B -> C -> C -> C -> A -> A -> E -> H 0.8725B -> A -> A -> A -> C -> C -> E -> G -> F -> G 0.8645B -> A -> A -> B -> B -> B -> B -> B -> C -> A -> E 0.8645B -> A -> C -> D -> E -> C -> A -> B -> H -> I -> I 0.8645B -> A -> D -> C -> A -> A -> A -> D -> D -> G -> H 0.8645B -> A -> F -> D -> A -> A -> C -> C 0.8645B -> B -> A -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8645B -> B -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8645E -> B -> A -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8645C -> I -> C -> E -> E -> C -> F -> D -> G -> I 0.8566C -> I -> E -> D -> G -> A -> A -> J -> H -> H 0.8566C -> H -> D -> A -> A -> A -> A -> A -> A -> A -> A -> H -> G 0.8406C -> H -> D -> A -> A -> B -> B 0.8406C -> H -> E -> G -> D -> D -> D 0.8406F -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8367F -> A -> B -> D -> E -> C -> B -> E -> C 0.8367F -> B -> A -> A -> C -> C -> D -> G -> I -> F -> H -> G 0.8367

Experiment 3:

We explore the execution time of the FGS algorithm under the different

simulation environment and minimum supports, as shown in Fig. 4.2. Here the number

of transactions is varied from 5000 to 20000 and minimum support from 20% to

100%. When the number of transaction increases, the execution time increases as well.


39

The FGS algorithm

0

1000

2000

3000

4000

5000

6000

1 0.75 0.5 0.38 0.25 0.2

Minimum Support

Exe

cutio

n tim

e(se

c)

C500I10D5000C500I10D10000C500I10D20000

Figure 4.2 the environment C500.I20.D5000

Experiment 4:

We explore the outcomes of the FGS algorithm are shown as Fig. 4.3. Here

we assume the fuzzy membership function are low <= 3, middle = 10 and high >= 18,

the max-interval = 30 days. We classify the customer by gender into male and female,

by age into 4 groups are 13-18, 19-25, 26-40 and 41-60. The gender and age does not

give the result effect. But we found when the number of transaction increases, the

execution time increases as well.

D500I10D5000

0

5

10

15

20

25

30

35

1 0.75 0.5 0.38 0.25 0.2

Minimum Support

Exe

cutio

n tim

e(se

c)

All Gender

Female

Male

Figure 4.2(a) the environment classify by gender


D500I10D5000

0

5

10

15

20

25

30

35

1 0.75 0.5 0.38 0.25 0.2

Minimum Support

Exe

cutio

n tim

e(se

c)

All Age

13-18

19-25

26-40

41-60

Figure 4.2(b) the environment classify by age


41

CHAPTER V

DISCUSSION

5.1 Comparison between algorithms.

In algorithm comparison, we consider the result of each algorithm in same

environment. We found the depth-first search (DFS) algorithm take less the execution

time than the bread-first search (BFS) algorithm. Although we change the value of

total customers, the total kinds of item and the total transactions, the outcome does not

have any result effect.

When we compare between the integration of fuzzy logic and graph search (FGS)

with the graph search technique (GST), we found the FGS algorithm takes less the

execution time than the GST algorithm. That because the FGS algorithm can prune out

the items more than the GST algorithm. For example, we defined the minimum

support at 20%. If item C occurred in 30 transactions and there are 25% of customers

who purchased its, that means all transaction of item C will be insert into 1-sequence

data table but the FGS algorithm used fuzzy set to classify item C into 3 set; C-hig

occurred in 12 transactions, C-mid occurred in 6 transactions and C-low occurred in 7

transactions. Even if item C have 25% of customers who purchased them, that does

not mean we will insert them into 1-sequence data table. We must investigate the

support value of C-hig, C-mid and C-low. If only C-hig have the support value greater

than minimum support that means 12 transactions of C-hig will insert into 1-sequence

data table. In above case, the FGS algorithm prune out 13 transactions of C so it use

the execution time to search the sequential patterns less than the GST algorithm. The

FGS algorithm abundantly reduces the outcome. The reduction may make us miss the

useful or important patterns.

We found the higher support value always use the less execution times than the

lower support value because the higher support value will prune out the transactions

more than the lower support value.

Sukanya Yuenyong Discussion / 42 Also, if we separate the data transactions by the other properties such as gender

and age. We found the outcomes are more valuable because we know who will buy the

items.

Finally, we consider the quantitative of each item. We found the outcomes are

more valuable because the GST algorithm gives us just what the customer will buy

after they already bought some items but the FGS algorithm give what and how many

are the customer will buy after they already bought some items. It can apply to many

issues such as catalog design, store layout, cross marketing strategies, cross inventory

strategies, effectiveness of promotional campaigns etc.

Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) / 43

CHAPTER VI

CONCLUSION AND RECOMMENDATIONS

6.1 Conclusion of this work We propose the integration of fuzzy logic and graph search (FGS) algorithm to

find sequential patterns from a transaction database. Through the experiments, we

found the FGS algorithm is superior to the graph search technique (GST) algorithm.

The advantage of the FGS algorithm over the GST algorithm as follows:

- The FGS algorithm takes less the execution time than the GST algorithm

when k-sequence is more than 1-sequence (k>=2).

- The outcomes of FGS algorithm are more valuable than GST algorithm

because we consider the quantitative value of item in each transaction.

The other advantage of the FGS algorithm follows:

- The FGS algorithm can generate k-sequence (k>=3) without following from

(k-1)-sequences

- The outcomes of FGS algorithm have more valuable because we consider

time space between each sequence.

The disadvantage of the FGS algorithm follows:

- The FGS algorithm takes more the execution time than the GST algorithm

when 1-sequence.

- The FGS algorithm abundantly reduces the outcome patterns. The reduction

may make us miss the useful or important patterns.

6.2 Recommendations In this work, we integrate fuzzy logic and graph search to mine sequential

patterns that given more advantage than using only graph search algorithm.

Sukanya Yuenyong Conclusion and Recommendation / 44

In the future we shall use another section of fuzzy logic to mine sequential

patterns that is fuzzy graph. It is very interesting to analyze how difference when we

integrate and merge fuzzy logic and graph search.

Fac. Of Grad. Studies, Mahidol Univ. M.Sc. (Tech. of Info. Sys.Management) / 45

REFERENCE

1. Qiankun Zhao and Sourav S. Bhowmick. Sequential Pattern Mining: A Survey

Nanyang Technological Univeristy, Singapore, 2003.

2. HAN, J. AND KAMBER, M. 2000. Data Mining Concepts and Techniques.

Morgan Kanufmann.

3. YANG, J., WANG, W., AND YU, P. S. 2001. Infominer: mining surprising

periodic patterns. In Proceeding of the seventh ACM SIGKDD international

conference on Knowledge discovery and data mining. ACM Press, 395-400.

4. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Eleventh

International Conference on Data Engineering, P. S. Yu and A. S. P. Chen,

IEEE Computer Society Press, Taipei, Taiwan, 3-14.

5. Tzung-Pei Hong, Kuie-Ying Lin and Shyue-Liang Wang. Mining Fuzzy

Sequential Patterns from Multiple-Item Transactions. I-Shou University,

Taiwan.

6. Yin-Fu Huang and Shao-Yuan Lin. Mining Sequential Patterns Using Graph

Search Techniques. Institute of Electronic and Information Engineering

National Yunlin University of Science and Technology.

7. http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol2/jp6/article2.html

8. http://www.boost.org/libs/graph/doc/graph_theory_review.html

9. http://www.csc.umist.ac.uk/people/wolkenhauer.htm 10. Olaf Wolkenhauer . Fuzzy Mathematics. Control Systems Centre of UMIST,UK.

11. Unil Yun. Mining Sequential Support Affinity Patterns with Weight Constraints.

Electronics and Telecommunications Research Institute, Telematics & USN

Research Division, Telematics Service Convergence Research Team,Korea.

http://www.doc.ic.ac.uk/%7End/surprise_96/journal/vol2/jp6/article2.html

Sukanya Yuenyong Biography / 46

BIOGRAPHY

NAME Miss Sukanya Yuenyong

DATEOF BIRTH 11 November 1982

PLACE OF BIRTH Bangkok, Thailand

INSTITUTIONS ATTENDED Siam University, 2004:

Bachelor of Science

(Computer Science)

Mahidol University, 2008:

Master of Science

(Technology of Information System

Management)

HOME ADDRESS 139/1 M.6, Bhudsakhon Rd.,

Suanluang, Kratumban,

Samutsakhon, Thailand

Tel. 08-9125-5005

E-mail : [email protected]

the integration of fuzzy logic and graph search for sequential

Documents