sponsored by aiat.or.th and kindml, siit · find invariant representations of data if possible. 5....

Table of Contents

Chapter 1. Introduction .................................................................................................................... 1 1.1. What is Data Mining and Text Mining? .................................................................................. 1 1.2. Common steps in data mining and text mining ..................................................................... 2 1.3. Types of Data in Data Mining/Text Mining ............................................................................ 3 1.4. Some Data Mining Applications ............................................................................................. 9 1.5. Text Mining Application ....................................................................................................... 10 1.6. Types of Mining Tasks .......................................................................................................... 12

1.6.1. Classification or Categorization: Finding the class of an object ................................... 12 1.6.2. Prediction: Predicting the value for an object ............................................................. 14 1.6.3. Clustering/Deviation Detection: Grouping data/Detecting outliers ............................ 15 1.6.4. Association Analysis: Finding frequent co-occurrences ............................................... 17 1.6.5. Characterization and Discrimination: Describing a class or concept ........................... 20 1.6.6. Meta Functionalities: Link, Outlier, and Trend/Evolutional Analysis ........................... 22 1.6.7. Visualization ................................................................................................................. 23

1.7. Challenges in Data Mining and Text Mining ......................................................................... 26 1.8. Summary .............................................................................................................................. 28 1.9. Historical Bibliography ......................................................................................................... 29 Exercise ............................................................................................................................................. 30

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

1

Chapter 1. Introduction

Data mining, also known as knowledge discovery, is a process to uncover hidden patterns from large-

scale data. Nowadays, it has attracted a great deal of attention in our society since there has been the

wide availability of huge amounts of recorded data, triggering an issue of information overload, and

the imminent need for turning such data into useful information and knowledge. Recently data

recorded in any storage have been growing up to the level of Giga (109) bytes or Tera (1012) bytes,

even Peta (1015) bytes in the near future. With this constant everlasting data generation, it generates

not only managerial issues but also challenging opportunities. Moreover, due to this data explosion

problem, nowadays people are drowning in this large amount of data, and starving for useful

knowledge. Towards efficient and effective utilization of huge amounts of data, data warehousing

with on-line analytical processing (OLAP) and data mining/knowledge discovery from databases

emerges to turn such large scattering data into useful systematic information and knowledge, using

availability of current cost effective computing power. While data warehousing with OLAP focuses

on more interactive manual exploring of interesting pattern, data mining seeks for a set of

autonomous methods in analyzing large amounts of data and extracting useful new knowledge

embedded in such data semi or fully automatically. While data mining is not new, it adopts

techniques from several traditional fields, such as machine learning, statistical reasoning and

information theory, to gain from observation of new complex phenomena the insight necessary to

increase our knowledge.

As an introductory material to a young and promising field of data mining and knowledge

discovery from data, the material in this book aims to provide basic data mining concepts and

techniques for revealing data patterns hidden in a large-scale data set, from the perspective of

database. Both technical background concepts and their examples are presented in order to make

readers obtain knowledge related to data mining and text mining effectively. This chapter describes

what data mining and text mining, including their approaches and applications. The readers will gain

more insight into the types of input data which mining can process, the types of output information

and knowledge that can be learned. They also can learn the types of algorithms that can be used to

extract patterns from the data to form pieces of information and knowledge, a number of issues

related to implementation, and challenging topics in data mining and text mining in the future.

1.1. What is Data Mining and Text Mining?

Data mining, also known as knowledge discovery in databases, refers to a set of methods to

extract or discover (mine) implicit, previously unknown, and potentially useful information or

knowledge from large amounts of data. The term “data mining” is a misnomer since its more

appropriate name is “knowledge mining.” There are also several other names, such as knowledge

extraction, data analysis, pattern analysis, data archaeology, data dredging, information

harvesting, business intelligence and so forth. An aim in data mining is to invent a set of methods

that seek regularities or patterns from a large-scale database automatically. Once strong patterns

are found, it is possible to use the pattern as generalization to make accurate predictions on

future data. Naturally during the mining process, a large number of patterns may be found but

only a small portion of this set is interesting and useful. The others tend to be spurious,

contingent on accidental coincidences in the particular dataset used. At this point, the important

issue is how to select those interesting patterns from a large pile of mined patterns. Moreover,

data found in real situation tend to be imperfect with garbling parts and/or missing values.

Methods in data mining need to be robust enough to cope with these imperfect data and to

extract regularities that are interesting and useful. Several core techniques in data mining come


2

from statistical analysis and machine learning, which can take the data in and infer whatever

structure underlying such data is.

In contrast with data mining that deal with structured data, text mining handles the

unstructured or semi-structured textual data (such as journal articles, news articles and online

web contents), not formalized database records. It usually involves the process of structuring the

input text, by the way of syntactic (parsing), semantic, discourse, and/or pragmatic analysis, to

derive patterns within the structured data, and finally evaluation and interpretation of the output.

Text mining usually aims to obtain the results with high relevance, novelty, and interestingness.

One dominant different characteristic between data mining and text mining is preprocessing.

While preprocessing in data mining seems simple by just focusing on data cleansing, data

integration, data transformation, and data reduction, preprocessing operations in text mining

requires more in the identification and extraction of representative features for texts written in a

natural language. Not concerned in data mining, these preprocessing operations are responsible

for transforming unstructured data stored in text collections into a more explicitly structured

intermediate format. By this characteristic, text mining needs the exploitation of techniques and

methodologies from the areas of natural language processing and human language processing,

including information retrieval, information extraction, text classification, text clustering, and

corpus-based computational linguistics.

However, it is quite common to find many data mining techniques used in text mining works

and vice versa. The architectures in both areas are also very similar. For example, both data

mining and text mining rely on preprocessing steps, mining algorithms, and presentation and

visualization techniques to enhance the interpretation of discovered patterns.

1.2. Common steps in data mining and text mining

The common process of data mining (knowledge discovery) is composed of the following nine

steps as shown in Figure 1-1. In the first step, the developers and users need to discuss what the

application looks like, which prior knowledge they have or need, and what the final goal of the

application is. After we determine the application, a dataset to be mined is prepared in the

second step. The third and fourth steps are two forms of data preprocessing. In the third step, the

data-cleansing step is performed in order to eliminate the noise that may be included in the data

set. It is also possible for us, in the fourth step, to eliminate unrelated parameters (or variable) in

order to obtain an unbiased or invariant data representation. After these preprocessing steps, it

is necessary for us to choose the task of mining in the fifth step. The common types of mining

tasks are classification, regression, clustering, association analysis, outlier analysis and trend

analysis. The task type implies what we intend to obtain from the data. For each type of tasks,

there may be several possible algorithms or methods to achieve the tasks. For example, there

exist several classification methods, such as naïve Bayes classification, decision tree induction,

classification rule construction, instance-based classification (k-nearest neighbor), and support

vector machines. Therefore, the sixth step is to select the suitable algorithm to achieve the task.

The seventh step is the main step where the mining process is taken place. In general, the mining

process will produce a set of generalized patterns, which may be a large set or may include to

specific patterns, known as overfitting. To solve these problems, the eighth step provides

mechanisms to evaluate the obtained patterns and filter out the trivial (uninteresting) ones, or to

visualize the patterns in order to enable a user to view them in the form of graphical

representation and then facilitate them to find novel or interesting patterns. Finally, the

discovered knowledge is ready to be used for future application.


3

1. Define the application, in the viewpoints of the

relevant prior knowledge and the end user's

goals.

2. Prepare a target data set used for knowledge

discovery or data mining.

3. Clean data, such as handling missing data fields,

reducing noise in the data, accounting for time

series, and transforming data to an appropriate

scale or format.

4. Reduce the number of parameters (variables) and

find invariant representations of data if possible.

5. Select a suitable task (classification, regression,

clustering, association analysis, and so on).

6. Select a suitable data mining algorithm for mining

based on the chosen task.

7. Perform data mining in order to find dominant

patterns.

8. Interpret and evaluate the patterns mined with a

sort of objective functions and/or perform

visualization to find interesting patterns

9. Consolidate and exploit knowledge discovered.

Figure 1-1: Steps in data mining

For text mining, a number of additional steps are needed in the second step, i.e.,

preprocessing unstructured or semi-structured textual data in the forms of documents. This step

is a necessary step to transform semi-structured or un-structured texts into a structured form

that is ready to perform data mining. Techniques and methodologies borrowed from the fields of

natural language processing (NLP) and human language technology (HLT) are needed for this

preprocessing. The other processes of text mining imitate the same steps with the process of data

mining.

1.3. Types of Data in Data Mining/Text Mining

This section describes a number of different data/text repositories for mining. Possible data

repositories for data mining are relational databases, transactional databases, temporal

databases, sequence databases, time-series databases, data streams, spatial databases, and

spatiotemporal databases¸ and multimedia databases while those for text mining are text

databases including offline and online electronic documents/texts, hypertext/hypermedia

databases, and digitalized document images. The challenges and techniques of data mining and

text mining may differ for each of the repository data types.

Relational Databases

A most typical data sources for mining may come from a relational database, which is a collection

of tables, each of which is assigned a unique name. Before obtaining the collection of relational

tables, database designers or developers may use a semantic data model, such as an entity-

relationship (ER) data model, as a tool to design the database. Here, each table consists of a set of

attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each


4

record (tuple) in a relational table represents an object identified by a unique key and described

by a set of attribute values. It is possible to join multiple tables to a table, and then apply a data

mining method on the integrated table. In general, relational databases may be used for both

operational purpose and management purpose. Often they are divided into operational

databases for the former level and data warehouses for the latter level, where mining process can

be taken place in both levels. Figure 1-2 shows two example files in a relational database: (a) one

with nominal attributes (label) and (b) the other with numeric attributes.

Case Fat Level (F) Protein Level (P) Glucose Level (G) Positive (C)

1 High High Normal Y 2 High High High Y 3 High Normal Normal Y 4 High Normal High N 5 Middle High High Y 6 Middle High Normal N 7 Middle Normal High N 8 Middle Normal Normal N 9 Low High Normal N

10 Low High High Y 11 Low Normal Normal N 12 Low Normal Normal N

(a) A data set with nominal attributes

Case Fat Level (F) Protein Level (P) Glucose Level (G) Positive (C)

1 270 18 80 Y 2 320 19 140 Y 3 300 10 90 Y 4 285 9 150 N 5 200 20 160 Y 6 170 21 100 N 7 190 12 130 N 8 130 13 95 N 9 100 20 80 N

10 95 19 140 Y 11 110 14 105 N 12 90 8 98 N

(b) A data set with numeric attributes

Figure 1-2: Two example files in a relational database; (a) one with nominal attributes, and (b)

the other with numeric attributes. Here, the mapping condition is as follows. Fat: high (>220),

middle (125-220), low (<125); Protein: normal (8-16) and high (>16); Glucose: normal (75-

120) and high (120).

Transactional Databases

Besides the relational databases, a transactional database can be used to keep events in the form

of a file where each record represents a transaction. Typically, each record in a transaction

database possesses a unique transaction identity number together with a list of the items making

up the transaction (such as items purchased in a store). Although it is possible to transform data

kept in a transactional database (Figure 1-3 (a)) into the form of a relational database (Figure 1-3

(b)), sometimes it is more convenient to store data in the former format. This format is usually

used when association rule mining (or frequent pattern mining) is applied to identify frequent

itemsets (the sets of items that frequently co-occur). More details can be found in Chapter 4.


5

TID ITEMS

1 bowl, fish, ice, pepsi, shirt, water

2 bowl, fish, ice, pepsi, rice, sweet, water

3 fish, meat, orange, rice, shirt, sweet, water

4 ice, pepsi, rice, shirt, sweet, water

5 bowl, ice, meat, orange, pepsi, rice, water

6 fish, meat, orange, rice, shirt, sweet, water

7 meat, pepsi, rice, sweet, water

8 bowl, pepsi, rice, shirt, sweet, water

(a) Transactional database format

TID Bowl Fish Ice Meat Orange Pepsi Rice Shirt Sweet Water

1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 4 1 1 1 1 1 1 5 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 7 1 1 1 1 1 8 1 1 1 1 1 1

(b) Relational database format

Figure 1-3: A part of (a) a transactional database keeping sales records of a supermarket, and (b) its corresponding relational database.

Temporal Databases, Sequence Databases, Time-Series Databases, and Data Streams

Temporal databases typically store data in a relational format together with time-related

attributes. These temporal attributes usually represent timestamps with various different

meanings, such as stock exchange, inventory control, telecommunications, and system

monitoring. As more general, a sequence database stores sequences of ordered events, with or

without a concrete notion of time, such as customer shopping sequences, web click streams, and

biological sequences. Some sequence databases, called data stream, have no time notation, such

as video stream, speech stream and sensor signal stream. There is a tradeoff that most data

stream are presented in a rather low level of abstraction but what we are interested usually is

located in higher and multiple levels of abstraction. Because of this, we need multilevel and

multidimensional analysis for mining on stream data. A time-series database stores sequences of

values or events obtained over repeated measurements of time (e.g., hourly, daily, weekly), such

as data collected from the observation of natural phenomena (like temperature and wind). An

objective of mining on these types of data is to find evolutional characteristics of an object or an

event, or to understand the trend of changes for objects or events in the database, finally then to

support decision making and strategy planning. Examples are mining of banking data to suggest

a suitable scheduling for bank tellers according to the volume of customer traffic and mining

from stock exchange data to uncover trends that help planning investment. Moreover, it is also

possible to interpret web log in order to understand user behavior. Figure 1-4 show four

examples of temporal data, sequence data, DNA sequence data, and time-series data. The first

example includes the records of person name, location, date-from and date-to when one stays at

a place and then change his/her residence to another place. The second example presents a

sequential customer purchasing history with customer ID, date, time, location and purchasing

items with their prices in brackets. The third example shows a fragment of DNA sequence,

composed of four different characters, i.e., A, C, G and T. The last example expresses the time-

series records of birth rate (per 10,000) for 23-year-old women in U.S. during 1917 to 1975.


6

Name Location From To

John Nuvo BigCity 01 September 1954 12 October 1998

John Nuvo SmallCity 13 October 1998 04 November 2006

John Nuvo BeachLand 05 November 2006 08 February 2009

Susan Dango RiceField 12 September 1956 30 July 1990

Susan Dango SmallCity 31 July 1990

Richard Navado RiverRock 15 April 1950 10 May 1998

Richard Navado IceCope 11 May 1998 23 March 2004

Richard Navado PizzaHut 24 March 2004

(a) An example of a temporal database related to -residence changing.

CusID Date Time Location ItemBought

C0001 10/11/2009 10:00 Central Station Apple ($10), Pepsi ($2), Gum ($1)

C0002 10/11/2009 11:00 Big C Burger ($3), Fish ($5), Bread ($3)

C0001 10/11/2009 14:00 Big C Rice ($10), Milk ($8), Bread ($3), Gum ($2)

C0004 11/11/2009 09:00 K-Mart Bread ($4), Coke ($2)

C0002 11/11/2009 12:00 Seven-Eleven Gum ($2), Sweet ($3), Potato ($3), Ham ($5)

C0001 11/11/2009 15:00 K-Mart Orange ($4), Magazine (6)

C0002 12/11/2009 11:00 Seven-Eleven Ham ($4), Bread ($3), Apple ($5), Water ($3)

C0003 12/11/2009 14:00 Central Station Rice ($12), Potato ($5), Butter ($6)

C0001 12/11/2009 15:00 BigC Diaper ($20), Butter ($4), Shirt ($20)

(b) An example of a sequence database related to customer purchasing history.

GAATTCTCTGTAAAAGAGTGGACCATTAAAGTAGAATTAATTGAAGGAATTCTCTGTAAGAGTACCAGGGAATTA

ATTAGGGTAGACCCATTGGGAAGGAATTCTGGCTGTAAGAGTACCATTAAAGTAGCCCCTTGAAGGGAATTTCTT

TGGCCCAAAGGGAAATAAGTAGAAAATTGAGAAACCCCCGGT

(c) A DNA sequence fragment

Year Birth Year Birth Year Birth Year Birth

1917 183.1 1932 131.5 1947 212.0 1962 252.8 1918 183.9 1933 125.7 1948 200.4 1963 240.0 1919 163.1 1934 129.5 1949 201.8 1964 229.1 1920 179.5 1935 129.6 1950 200.7 1965 204.8 1921 181.4 1936 129.5 1951 215.6 1966 193.3 1922 173.4 1937 132.2 1952 222.5 1967 179.0 1923 167.6 1938 134.1 1953 231.5 1968 178.1 1924 177.4 1939 132.1 1954 237.9 1969 181.1 1925 171.7 1940 137.4 1955 244.0 1970 165.6 1926 170.1 1941 148.1 1956 259.4 1971 159.8 1927 163.7 1942 174.1 1957 268.8 1972 136.1 1928 151.9 1943 174.7 1958 264.3 1973 126.3 1929 145.4 1944 156.7 1959 264.5 1974 123.3 1930 145.0 1945 143.3 1960 268.1 1975 118.5 1931 138.9 1946 189.7 1961 264.0

(d) A time-series: Births per 10,000 of 23 year old women, U.S., 1917-1975

Figure 1-4: Temporal data, sequence data, DNA sequence data and time-series data

Spatial Databases and Spatiotemporal Databases

Spatial databases contain information related to space, including points, lines and polygons.

Examples are geographic (map) databases, computed-aided design databases, and medical and

satellite image databases. Spatial data may be represented in the raster format (Figure 1-5 (a)),

consisting of n-dimensional bitmaps or pixel maps. They can also be represented in the vector

format (Figure 1-5 (b)), where roads, bridges, buildings, and lakes are represented as unions or

overlays of basic geometric constructs, such as points, lines, polygons, and the partitions and


7

networks formed by these components. Incorporating temporal features into spatial databases,

the spatiotemporal databases also keep time component in spatial databases. Applications of

spatial and spatiotemporal databases are varied, such as routing and scheduling, city planning,

forestry and ecology planning to provide public service information regarding the location of

basic facilities such as telephone, and electric cables, pipes, and sewage systems.

(a) the raster format (b) the vector format

Figure 1-5: Spatial data represented: (a) the raster format and (b) the vector format

Multimedia Databases

Multimedia databases are repositories of image, audio, and video data, used in several

applications including picture content-based retrieval, voice-mail systems, video-on-demand

systems, and speech-based user interfaces that recognize spoken commands. In computer

systems, a multimedia database may keep one or more primary media file types such as .txt

(documents), .jpg (images), .swf (videos), .mp3 (audio) and so on. A multimedia data are loosely

classified into three main categories; (1) static media (time-independent, i.e. images and

handwriting), (2) Dynamic media (time-dependent, i.e. video and sound bytes), and (3)

dimensional media (i.e. 3D games or computer-aided drafting programs – CAD). For multimedia

data mining, storage and search techniques need to be integrated with standard data mining

methods. Some promising tools are the construction of multimedia data cubes, the extraction of

multiple features from multimedia data, and similarity-based pattern matching.

Text Databases: Offline and Online Electronic Documents/Texts

Text databases consist of information in the form of sentences or paragraphs, such as journal

articles, e-books, product descriptions and specifications, reports and notes, or other documents,

stored in either offline or online environments. Some text databases are unstructured, such as

electronic documents and plain text (without tags) on the World Wide Web (Figure 1-6). Some

text databases are partially structured or semi-structured, such as e-mails and web pages with

specified tags (Figure 1-7). Some text databases are highly structured such as library catalogues.


8

Such text databases with highly regular structures typically can be implemented using relational

database systems. Mining processes in text databases includes summarization, categorization,

clustering, information retrieval and document association.

Data mining is the process of extracting patterns from data. As more data are gathered, with the amount of data

doubling every three years [1] data mining is becoming an increasingly important tool to transform these data into

information.

….

To avoid confusion with the other sense, the terms data dredging and data snooping are often used. Note, however,

that dredging and snooping can be (and sometimes are) used as exploratory tools when developing and clarifying

hypotheses.

Figure 1-6: An example of a plain text without tags.

Figure 1-7: An example of book records with structure (source: http://lib.siit.tu.ac.th/)

Hypertext/Hypermedia Databases

A hypertext is a text with references or links (called hyperlinks) to other text that the reader can

access promptly by following a hyperlink in the text, usually by a mouse click or key press

sequence (Figure 1-8). Hypermedia is a logical extension of hypertext in which graphics, audio,

video, plain text and hyperlinks are integrated together to create a non-linear medium of

information. Recently hypertexts or hypermedia databases have become major descriptions of

objects in both online and offline environments. Users can find information of interest by

traversing from one object via links to another. This link information provides new opportunities

and challenges for data mining. For example, with link information, we can improve performance

of text classification/clustering and relevant ranking process. Apart from text information,

known as web usage mining, it is also possible to interpret user access patterns via the usage of

hypertexts or hypermedia, in order to improve system design and user behavior analysis.


9

(a) a hypertext with links

[Source]

<b>Hypertext</b> is text displayed on a computer with references (<a href="/wiki/Hyperlinks"

title="Hyperlinks">hyperlinks</a>) to other text that the reader can immediately access, usually by a mouse

click or keypress sequence. Apart from running text, hypertext may contain tables, images and other

presentational devices. Other means of interaction may also be present, such as a bubble with text appearing

when the mouse hovers over a particular area, a video clip starting, or a form to complete and submit. The

most extensive example of hypertext today is the <a href="/wiki/World_Wide_Web" title="World Wide

Web">World Wide Web</a>.</p>

(b) the source of the hypertext

Figure 1-8: An example of a hypertext with its source

(source: http://en.wikipedia.org/wiki/Hypertext)

1.4. Some Data Mining Applications

Data mining can be applied in wide and diverse areas. Some major applications are listed below.

Data Mining in Retail Industry Data

Finding interesting patterns from large amounts of data on sales, customer shopping records,

goods transportation, consumption, and service, provides useful information and knowledge for

managing retail industry. Data mining in retail databases can help us identify customer shopping

behaviors, patterns, and trends. This information will be useful to improve the quality of

customer service, achieve better customer retention and satisfaction, enhance goods

consumption ratios, design more effective goods transportation and distribution policies, and

reduce the cost of business. Recently the popularity of conducting business transactions online

(e-commerce) has increased the quantity of data collected continues to expand rapidly.

Exploiting data mining on the databases to pursue purchasing patterns can help guide the design

and development of data warehouse structures, manage effective sales campaigns, retain

customer with personalized product recommendations and targeted services.

Data Mining in Financial Data

Currently, a wide variety of financial services have been offered, such as deposit, withdrawal,

loan, foreign exchange, and investment. Financial data collected from these services are relatively

reliable with high quality. Therefore, data analysis and/or data mining (knowledge discovery) on


10

these data are realistic and useful. For example, data analysis enables us to view the average,

total, maximum, minimum, trend, outlier and other statistical characteristics of each financial

resource (e.g., deposit, credit, fund, stock, etc), together with their changes by time period, by

geographical region, by customer sector, and by other factors. It is possible to cluster or

categorize customers into a set of target groups for marketing purpose, such as loan payment

prediction and credit policy analysis. Data mining in financial data can also support the detection

of money laundering and other financial crimes. To detect money laundering and other financial

crimes, it is necessary to integrate information from multiple databases for finding unusual

patterns, such as large amounts of cash flow at certain periods, by certain groups of customers.

Data Mining in Telecommunication Data

At present telecommunication industries has generated a tremendous amount of data. Falling

into three categories; the call detail data presenting the calls traversing within the

telecommunication networks, the network data describing the state of the hardware and

software components used in the network, and the customer data indicating the

telecommunication usage conducted by customers. Data mining can be used to uncover useful

information hidden in these three data sets to identify telecommunication fraud, identify

network faults, and improve marketing performance. Fraudulent activity costs the

telecommunication industry millions of dollars per year. With data mining techniques, we can

detect potential fraudulent actions and usage patterns; identify misconducts to gain fraudulent

entry to other customers’ accounts; and reveal unusual patterns that may harm the system itself,

such as busy-hour frustrated call attempts, as well as router/switch congestion. With data mining,

we can find the software/hardware-related problem inside the telecommunication system.

Moreover, finding some frequent sequential patterns, such as frequent calling patterns, we can

promote such as the sales of specific long-distance and cellular phone combinations and improve

the availability of particular services in the region.

Data Mining in Biological Data

Recently, biotechnology has become popular with many potential usages. Its quick advancement

triggers an explosive growth of biological data, such as those in genomics, functional genomics,

proteomics, and biomedical studies. Applications include the identification and comparative

analysis of human genomes and other species’ genomes (by discovering sequencing patterns,

gene functions, and evolution paths), the investigation of genetic networks and protein pathways,

and the innovation in new pharmaceuticals and advances for cancer treatment. Known as

bioinformatics, the field of biological data mining is broad, rich, and dynamic.

1.5. Text Mining Application

While text mining involves a knowledge-based process, where a user interacts with a document

collection, its tasks and applications are vast. Analogous to data mining, text mining, (alternately

referred to as text data mining) aims to extract useful information from data sources through the

identification and exploration of interesting patterns. Some typical text mining tasks are text

categorization, text clustering, concept/entity extraction, production of granular taxonomies,

sentiment analysis, document summarization, and entity relation modeling (i.e., learning

relations between named entities). A number of text mining applications are given below.

Text Mining for Biomedical Applications

With accelerated growth of online biomedical information, a set of computational tools are

required to filter public biomedical text databases and to highlight their relevant information in a


11

well-organized and coherent manner. For example, it is possible to find large numbers of

apparent correlations when we analyze relationships among thousands of genes by analyzing

gene-gene relationships stated in biomedical research articles related to mRNA expression

profiling experiments with cDNA microarrays and oligonucleotide chips. A large pile of

biomedical literatures provides us a great chance to detect hidden relationship.

Text Mining for Security Applications

Recently information security has evolved from just focusing on data or actions related to the

network and server layers to including text contents, such web and email, transmitted through

the network, in the application layer. Recently a number of powerful software packages are

developed to monitor web content, radio broadcasting, and cellular/telephone speech in order to

find certain keywords. When the keywords are found, the web result and/or speech results will

be recorded and analyzed. One of dominant systems related this text mining for security issues is

the classified ECHELON surveillance system development by the United States National Security

Agency (NSA) in partnership with the UK, Canada, Australia and New Zealand. Besides this,

AeroText, and Attensity are software marketed towards security applications, by analyzing plain

text sources such as online news article. Text categorization enables us to understand the

difference among characteristics of normal and malicious user behaviors from the log entries

generated by an online application server.

Text Mining for Marketing Applications

Text mining also could help marketing professionals use the mined information for finding

nuggets in order to make good decisions. As a simple type of text mining, search engines (or

information retrieval systems) can help us find a way to improve our daily tasks. For example,

one can type a set of keywords related to his/her interesting topic to find a set of existing related

documents that may help. Moreover, text mining can help us to find relationship among different

keywords using concept clustering, indexing, association, feature extraction, information

visualization, and summarization. This innovative technology helps marketing professionals

identify hidden information that could leverage business opportunities, with visually interactive

tools to depict patterns and relationships between keywords that form user-friendly interfaces.

Text Mining for Academic Applications

Many academic articles have been provided online in both abstract and full texts to public or for

commercial purposes, e.g. CiteSeer, ACM portal, Elsevier ScienceDirect, SpringerLink, PubMED,

MS Academic Search, Google scholars and DBLP. As the most basic function, text mining can help

us properly index documents for later retrieval of similar academic articles. It also assists us to

group similar articles for literature review or other purposes, to provide semantic cues to

machines to answer specific queries, to find relations among a set of multiple articles for deep

analysis.

Text Mining for Patent Analysis

Generally, patent documents express important research and development results. However,

they are usually lengthy with rich technical terminology, resulting in needing high human efforts

for analyses. From many million patent documents in a patent database, one may need to find

those similar to a given or intended one. An emerging application of text mining is to search

patents based on similarity to assist patent engineers or decision makers in patent analysis.


12

1.6. Types of Mining Tasks

As stated in Section 1.3, there are various possible types of databases and information

repositories on which we can apply data mining or text mining to find the intuitive, valid, useful

and novel patterns. In general, there are two classes of data mining tasks as follows.

1. Descriptive mining targets for characterizing some general properties of the data in the

database without any use for guessing a future event, such as characterization,

discrimination, clustering and association rule mining or frequent pattern mining.

2. Predictive mining aims to build a model that is used later for making prediction of related

events or prospective events, by making use of information hidden in the current data,

such as classification, numeric prediction, and pattern recognition.

After having a set of data to be mined, we have to determine which kind of mining tasks to be

performed. In many cases, we may have no idea on what kinds of interesting patterns we can find

from the data and hence we may take a strategy to search for several different kinds of patterns

in parallel. Mining multiple kinds of patterns may accommodate different user expectations or

applications. Besides the type of patterns to be mined, in several cases, we need to consider a

mechanism to guide users to search for and discover interesting patterns at various granularities

(different levels of abstraction) with interactive environment, and to allow user to track topic

changing among different time periods. Moreover, since some patterns may not hold for all of the

data in the database, a measure of certainty or “trustworthiness” is usually associated with each

discovered pattern. The following indicates some common types of data mining tasks.

1.6.1. Classification or Categorization: Finding the class of an object

Classification (or categorization) is a common task in human activities that involves decision or

forecast in an unknown or a future situation, using currently available information. As another

point of view, classification is the process of constructing a model (or function) that describes

and distinguishes different data classes or concepts, for the purpose of being able to use the

model to predict the class of objects whose class label is unknown later. The derived model is

based on the analysis of a set of training data (i.e., data objects whose class label is known). In

some areas, classification is referred as pattern recognition, discrimination, or supervised

learning, in contrast with unsupervised learning or clustering where no classes are predefined

but they are inferred from the data. There have been several applications of classification to solve

scientific, industrial and commercial problems. Some typical classification tasks are the detection

of the letter from a character image (such as an automatic postcode reader), the credit-status

assignment for a customer on the basis of financial and other personal information, and the

preliminary diagnosis of a patient’s disease during waiting for definitive test results.

In learning a classification model, there exist various forms in expressing the model derived.

Some common forms are classification (IF-THEN) rules, decision trees, mathematical formulae,

or neural network. Given a sample dataset in Figure 1-9 (a), the examples of these forms are

shown in Figure 1-9 (b)-(d) and Figure 1-10, respectively. A decision tree is a flow-chart-like tree

structure, where each node denotes a test on an attribute value, each branch represents an

outcome of the test, and tree leaves represent classes or class distributions. Decision trees can

easily be converted to classification rules. A neural network, when used for classification, is

typically a collection of neuron-like processing units with weighted connections between the

units. There are many other methods for constructing classification models, such as naïve

Bayesian classification, support vector machines, and k-nearest neighbor classification.


13

Outlook Temp. Humidity Windy Play

sunny hot high false no sunny hot high true no

overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no

overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes

overcast mild high true yes overcast hot normal false yes

rainy mild high true no

(a) A sample data set (the Play-Tennis data set)

1. If (Outlook = ”Overcast”) then Play = “Yes” 2. If (Humidity = ”Normal” and Windy = “False” ) then Play = “Yes” 3. If (Temp = ”Mild” and Humidity = “Normal” ) then Play = “Yes” 4. If (Outlook = ”Rainy” and Windy = “False” ) then Play = “Yes”

(b) Classification rules

(c) Decision Tree

Play(yes) = 0.6 * outlook(sunny) + 1.0 * outlook(overcast) + 0.2 outlook(rainy) + 0.1 * temp(hot) + 0.2 * temp(mild) + 0.2 * temp(cool) + 0.5 * humidity(high) + 0.8 * humidity(normal) + 0.6 * windy(false) + 0.3 * windy(true)

Play(no) = 0.3 * outlook(sunny) + 0.1 * outlook(overcast) + 0.7 outlook(rainy) + 0.2 * temp(hot) + 0.1 * temp(mild) + 0.3 * temp(cool) + 0.7 * humidity(high) + 0.1 * humidity(normal) + 0.3 * windy(false) + 0.8 * windy(true)

(d) Linear regression equations

Figure 1-9: Three classification models: (a) a sample classification dataset, (b) classification

(IF-THEN) rules, (c) a decision tree, and (d) mathematical formulae (linear equations).

Outlook

Windy Humidity

sunny rainy

overcast

high normal

Yes

No Yes

false true

Yes No


14

Figure 1-10: An example of an artificial neural network.

1.6.2. Prediction: Predicting the value for an object

While classification predicts categorical (discrete, unordered) labels, prediction forecasts

continuous-valued functions. Also called numeric prediction, the prediction is applied to estimate

missing or unavailable numerical data values rather than class labels. It is intuitive that the term

‘prediction’ points to both numeric prediction and class label prediction. However, in several

literatures it usually refers to numeric prediction while label prediction is called classification.

Prediction also encompasses the identification of distribution trends based on the available data.

For example, it is possible to predict the potential sales amount of a product given its price, or the

performance of a computer given its components.

Regression analysis, a statistical methodology developed by Sir Frances Galton (1822–1911),

is most often used for numeric prediction. Although there are other methods as well in numeric

prediction, in fact scientists and researchers often use the terms “regression” and “numeric

prediction” synonymously. It is also possible to apply classification techniques (such as artificial

neural networks, decision trees, support vector machines, and k-nearest-neighbor classifiers) for

prediction, and on the other hand, numeric prediction techniques (i.e., regression analysis) for

classification. Regression analysis aims to model the relationship between one or more

independent or predictor variables, which are discrete- or continuous-valued, and a dependent

or response variable, which is continuous-valued. In data mining or knowledge discovery, the

predictor variables are the attributes of interest expressing each training tuple (example),

possibly in the form of an attribute vector. In general, the values of the predictor variables are

known. Even some of them may be missing, it is possible to apply statistical techniques to

recover and handle such cases. On the other hand, the response variable is the target value to be

predicted, also called the predicted attribute. Some common models for prediction are linear

regression, regression tree and model tree as shown in Figure 1-11 (b)-(d), given a sample

dataset in Figure 1-11 (a) where ‘Play’ is the target value and the four preceding variables

(‘Outlook’, ‘Temp’, ‘Humidity’ and ‘Windy’) are predictor variables.

play=yes

play=no

outlook=sunny

outlook=overcast

outlook=rainy

temperature=hot

temperature=mild

temperature=cool

humidity=high

humidity=normal

windy=true

windy=false


15

Outlook Temp. Humidity Windy Play

90 40 80 10 5 95 32 85 80 10 50 35 90 20 80 10 24 80 5 95 15 10 50 15 85 20 12 55 90 15 55 9 45 95 80 85 22 95 25 10 95 7 50 5 100 5 26 45 10 85

80 25 40 80 95 45 24 85 85 90 40 37 60 15 75 25 23 90 95 20

(a) A sample data set (the real-valued Play-Tennis data set)

Play = 149.3 – 0.171*outlook – 0.273*temp – 0.298*humidity – 0.359*windy

(b) Linear regression equation

L1: Play = 14.2 – 0.24*temp + 0.04*windy L2: Play = 100.3 – 0.07*windy L3: Play = 53 + 0.29*humidity + 0.15*windy L4: Play = 69.3 + 0.9*temp + 0.29*humidity L5: Play = 7.1 + 0.14*humidity

(c) Regression Tree (d) Model Tree

Figure 1-11: Four common forms of numeric prediction models: (a) A sample prediction data

set (b) Linear regression equation, (c) Regression tree, and (d) Model tree.

Both classification and prediction may need to be preceded by relevance analysis, which

attempts to identify attributes that do not contribute to the classification or prediction process.

These attributes can then be excluded.

1.6.3. Clustering/Deviation Detection: Grouping data/Detecting outliers

Another important activity in the learning process of our daily life is cluster analysis. From our

childhood, we learn how to distinguish between two different types of objects, say cats vs. dogs

or car vs. bus, by developing subconscious clustering schemes continuously. Conceptually, we

identify dense and sparse regions in object space and, then find out overall distribution patterns

and interesting correlations among data attributes. Unlike classification and prediction where

each data object (for example, each row in the tables of Figure 1-9 (a) and Figure 1-11 (a)) has

Outlook

Windy Humidity

> 75 < 35

35-75

>= 65 < 65

L3

L1 L2

<50 >=50

L4 L5

Outlook

Windy Humidity

> 75 < 35

35-75

>= 65 < 65

81.25

8.33 97.5

<50 >=50

88.33 17.5


16

assigned a class label or a value (the target variable which is the rightmost column in the tables),

clustering processes data objects without considering a class label. Also called data

segmentation in some areas, clustering partitions large data sets into groups according to their

similarity. For clustering, the class labels do not exist in the training dataset since they are not

concerned. The aim of clustering is to group data into a set of clusters (groups) and then to give a

unique label for each cluster. Normally, the input objects are clustered or grouped based on the

criterion of maximizing the intraclass similarity and minimizing the interclass similarity. In other

words, clusters of objects are generated under the criterion that the objects within a cluster have

high similarity with one another, but are highly dissimilar to the objects in other clusters.

Semantically, each cluster is recognized as a class of objects. As a by-product, it is possible to

detect outliers and their deviation from the norms.

At present, cluster analysis has been widely used in several applications, including concept

formation, pattern recognition, market research, biology data analysis, land usage analysis,

speech processing, document processing and image processing. In concept formation, clustering

can be applied to develop concepts based on the common properties of objects, events, or

qualities with abstraction and generalization. In pattern recognition, clustering can be used to

identify different areas of interest. In market research, clustering can help discover distinct

groups in the existing customers and then characterize these customer groups based on their

purchasing activities or patterns. In biology data analysis, clustering may assist biologists to

derive plant and animal taxonomies, to categorize genes with similar functionality, and to gain

insight into structures inherent in populations. In land usage analysis, clustering can be applied

to identify similar land usage area from an earth observation database or to detect groups of

houses in a city according to house type, value, and geographic location, as well as to find groups

of automobile insurance policy holders with a high average claim cost. In speech processing,

clustering can be used to grouping similar features (feature vectors) to reduce the variety of

features and then improve the recognition rate. In document processing, clustering can also be

used to help classify documents on the WWW for information discovery. In image processing,

clustering may help segment an image into a group for further exploration. Clustering can also be

used to detect outliers which are unusual objects deviating from the norm of any cluster. These

outliers sometimes are more interesting than common cases. Some typical applications of outlier

detection are the monitoring of criminal activities in the Internet and the detection of credit card

fraud in electronic commerce. Alternatively, clustering is often used as a preprocessing step for

other algorithms, such as classification, association analysis (or frequent pattern mining),

characterization, and attribute subset selection, which would then operate on the detected

clusters and the selected attributes or features. A major advantage of this clustering-based

preprocess is adaptability to changes and helps find out useful features that distinguish different

groups. Figure 1-12 shows various types of cluster representation: (a) the conceptual

representation of clustering (three clusters with some outliers), (b) Non-overlapping clusters

(three clusters), (c) Overlapping clusters (three clusters), (d) Hierarchical clustering, and (e)

Probabilistic clustering. Intuitively clusters usually have no overlap as shown in Figure 1-12 (b)

but in some cases, some clusters may be overlapped with one another as shown in Figure 1-12

(c). Clustering can also facilitate taxonomy formation, that is, the organization of observations

into a hierarchy of classes that group similar events together as shown in Figure 1-12 (d). It is

possible to allow objects to be in more than one cluster or even in all clusters with some

probabilities as shown in Figure 1-12 (e).


17

(a) Conceptual representation: three clusters with some outliers

(b) An example: three non-overlapping clusters with their elements

(c) Three overlapping clusters (d) Hierarchical Clustering

1 2 3

a 0.8 0.1 0.1 b 0.4 0.4 0.2 c 0.1 0.5 0.4 d 0.1 0.8 0.1 e 0.5 0.3 0.2 f 0.4 0.5 0.1 g 0.4 0.2 0.4 h 0.2 0.1 0.7 i 0.7 0.1 0.2 j 0.1 0.1 0.8 k 0.2 0.7 0.1

(e) Probabilistic Clustering

Figure 1-12: Various types of cluster representation

1.6.4. Association Analysis: Finding frequent co-occurrences

Co-occurrence analysis is to find frequent patterns, which are patterns that occur frequently in

data. There are several kinds of frequent patterns, including itemsets, subsequences, and

substructures. Typically, a frequent itemset refers to a set of items that frequently appear

together in a transactional database, such as coffee and sugar. A frequent subsequence is a

sequence of items that occurs often in sequence, such as the pattern that customers tend to

purchase a notebook first, followed by a digital camera, and then a memory card. A frequent

substructure refers to structural forms, such as graphs, trees, or lattices, which occur often. The

substructure may be a combination of itemsets or subsequences. Mining frequent patterns leads

to the discovery of interesting associations and correlations within data. For example, the rule

found in the sales data of a supermarket would indicate that if a customer buys onions and

Cluster 1 Cluster 2 Cluster 3

h

c b

e

f

d

a

g i j

k

h

c

b

e

f d

a g

i j

k

h

c

b

e

f d

a g

i j

k

d k b a e i g f c j h


18

potatoes together, he or she is likely to buy beef. Such information can be used as the basis for

decisions about marketing activities such as, e.g., promotional pricing or product placements.

In addition to the above example from market basket analysis on retailing databases,

association analysis are employed today in many application areas including web usage mining,

intrusion detection and bioinformatics. Web usage mining is the application of associations to

analyze and discover interesting patterns of user’s usage data from web log, both incoming log

(Web Server Log) and outgoing log (Proxy). After the Internet becomes a powerful tool for

communication and collaboration, many organizations have a web server that generates and

collects large volumes of data collected in server log. These logs keep track of the usage records

of the user’s behavior when the user browses or makes transactions on the web site. By

analyzing frequent access pattern, it is possible to provide better services, including target or

cross marketing, customer relationship management, pricing and promotional campaigns,

according to the needs of users or web-based applications. The most naïve application is a web

analysis tool that simply provided mechanisms to report user activity as recorded in the servers,

such as the number of accesses to the server, the times or time intervals of visits as well as the

domain names and the URLs of users of the web server. Besides simple applications, at present

more sophisticated pattern discovery tools and pattern analysis tools are emerged to discover

and analyze interesting patterns. As another potential application of association analysis,

network intrusion detection system (IDS) can provide a security protection mechanism as

complementary to the firewall. Since it can discern and respond to the hostile behavior of the

computer and network resource, recent this topic has become a hot area in network security.

Recently there has been a great amount of research works applying data mining in biology

field. Research on bioinformatics, also called biocomputing and computational biology, is a

promising infant field that uses computers and information technology in molecular biology and

develops algorithms and methods to manage and analyze huge volumes of any type of biological

data from individual molecules to organisms to overall ecology. Major applications include

sequence analysis, genomics, and proteomics. Sequence analysis aims to study molecular

sequence data for inferring the function, interactions, evolution, and perhaps structure of

biological molecules. It is important to develop effective methods to compare and align biological

sequences and discover biosequence patterns. Genomics involves the analysis of the context of

genes or complete genomes (the total DNA content of an organism) within the same and/or

across different genomes. Proteomics is the sub-field of genomics concerned with analyzing the

complete protein complement, i.e. the proteome, of organisms, both within and between different

organisms. Mining semantics from DNA and proteins sequences, long linear chains of chemical

components, is a challenging issue. An automatic alignment of DNA or proteins sequences lines

up sequences to achieve a maximal level of identity or the highest degree of similarity between

sequences. Two sequences are homologous if they share a common ancestor. The degree of

similarity obtained by sequence alignment can be useful in determining the possibility of

homology between two sequences. Such an alignment also helps determine the relative positions

of multiple species in an evolution tree, which is called a phylogenetic tree.

Figure 1-13 shows an example of association rule mining in a toy transactional database of

retail industry. The example consists of six sales transactions, shown in two formats: transaction

format (a) and relational format (b). Given 50% minimum support and 66% confidence, the

mined frequent itemsets and frequent rules (association rules) are shown in (c) and (d),

respectively. As another example, Figure 1-14 displays a temporal sale summary of three

products in six periods. The original format (a) can be transformed to the transactional database

format (b). Given 50% minimum support and 66.67% confidence, the mined frequent itemsets

and frequent rules (association rules) are shown in (c) and (d), respectively.


19

Transaction ID Items 1 coke, ice, paper, shoes, water 2 ice, orange, shirt, water 3 paper, shirt, water 4 coke, orange, paper, shirt, water 5 ice, orange, shirt, shoes, water 6 paper, shirt, water

(a) A simple transactional database for retailing ( )

Transaction ID coke ice orange paper shirt shoes water 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 4 1 1 1 1 1 5 1 1 1 1 1 6 1 1 1

(b) The relational table corresponding to the above transactional database

Itemset Trans Freq.

Itemset Trans Freq.

Itemset Trans Freq. coke 14 2

ice, orange 25 2

orange, shirt, water 245 3

ice 125 3

ice, paper 1 1

paper, shirt, water 346 3 orange 245 3

ice, shirt 25 2

paper 1346 4

ice, water 125 3

shirt 23456 5

orange, paper 4 1

shoes 15 2

orange, shirt 245 3

water 123456 6

orange, water 245 3

paper, shirt 346 3

paper, water 1346 4

shirt, water 23456 5

(c) Frequent itemsets (minimum support = 3 (i.e., 0.50 or 50%))

Here,

Left Items (L) Right Items (R) support confidence ice water 3/6 = 0.50 3/3 = 1.00

water ice 3/6 = 0.50 3/6 = 0.50 orange shirt 3/6 = 0.50 3/3 = 1.00

shirt orange 3/6 = 0.50 3/5 = 0.60 orange water 3/6 = 0.50 3/3 = 1.00

water orange 3/6 = 0.50 3/6 = 0.50 paper shirt 3/6 = 0.50 3/4 = 0.75

shirt paper 3/6 = 0.50 3/5 = 0.60 paper water 4/6 = 0.67 4/4 = 1.00 water paper 4/6 = 0.67 4/6 = 0.67

shirt water 5/6 = 0.83 5/5 = 1.00 water shirt 5/6 = 0.83 5/6 = 0.83

orange, shirt water 3/6 = 0.50 3/3 = 1.00 orange, water shirt 3/6 = 0.50 3/4 = 0.75

shirt, water orange 3/6 = 0.50 3/5 =0.60 water orange, shirt 3/6 = 0.50 3/6 = 0.50

shirt orange, water 3/6 = 0.50 3/5 =0.60 orange shirt, water 3/6 = 0.50 3/3 = 1.00

paper, shirt water 3/6 = 0.50 3/3 = 1.00 paper, water shirt 3/6 = 0.50 3/4 = 0.75

shirt, water paper 3/6 = 0.50 3/5 =0.60 water paper, shirt 3/6 = 0.50 3/6 = 0.50

shirt paper, water 3/6 = 0.50 3/5 =0.60 paper shirt, water 3/6 = 0.50 3/4 = 0.75

(d) Association rules (minimum confidence = 0.67 or 66.67%)

Figure 1-13: Association Rules with Minimum Support = 0.5 and Minimum Confidence = 0.67


20

t0 t1 t2 t3 t4 t5 t6 pepsi up - up - up - - shirt down down up down up - up water - - down down - down -

(a) A toy temporal database (t: time)

Transaction Items t0(L) + t2(R) pepsiL-up, shirtL-down, pepsiR-up, shirtR-up, waterR-down t1(L) + t3(R) shirtL-down, shirtR-down, waterR-down t2(L) + t4(R) pepsiL-up, shirtL-up, waterL-down, pepsiR-up, shirtR-up t3(L) + t5(R) shirtL-down, waterL-down, waterR-down t4(L) + t6(R) pepsiL-up, shirtL-up, shirtR-up

(b) The corresponding transactional database when time gap is fixed to 2

Itemset Freq.

Itemset Freq. pepsiL-up 3

pepsiL-up, shirtR-up 3

shirtL-down 3

pepsiL-up, waterR-down 1 shirtL-up 2

shirtL-down, shirtR-up 1

waterL-down 2

shirtL-down, waterR-down 3 pepsiR-up 2

shirtR-down 1 shirtR-up 3 waterR-down 3

(c) Temporal frequent itemsets (minimum support = 2.5 (i.e., 0.50 or 50%))

Left Items Right Items support confidence pepsiL-up shirtR-up 3/5 = 0.60 3/3 = 1.00

shirtL-down waterR-down 3/5 = 0.60 3/3 = 1.00

(d) Temporal association rules (minimum confidence = 0.50 or 50%)

Figure 1-14: Temporal Association rules

1.6.5. Characterization and Discrimination: Describing a class or concept

While classification, prediction, clustering and association mining are common in their purpose

to automate knowledge discovery, characterization and discrimination, as alternatives, aim to

support users in finding knowledge semi-automatically by providing summarized or contrast

information of classes or concepts. It is useful to describe such individual classes and concepts in

summarized, concise, and even more precise terms. These descriptions can be obtained by data

characterization, such as summarizing the data of the class with general terms, or data

discrimination, in the form of comparing the target class with one or a set of comparative

classes, or the combination of data characterization and discrimination.

Data characterization refers to a summarization of the general characteristics or features of a

target class of data. For example, to study the characteristics of food products whose sales

decreased by 10% in the last year, the data related to such products can be collected by executing

an SQL query. For this issue, one method for effective data characterization (summarization) is

the form of cube–based OLAP (OnLine Analytical Processing) roll-up operation performed as

user-controlled data summarization along a specified dimension. The output of data

characterization can be presented in various forms, such as line or cursive graphs, pie charts, bar

charts, multidimensional data cubes, and multidimensional tables with crosstabs. As more

generalized forms, the resulting descriptions can also be presented as generalized relations or in

rule form.


21

In contrast, data discrimination is a comparison of the general features of target class data

objects with the general features of objects from one or a set of contrasting classes. The target

and contrasting classes can be specified by the user, and the corresponding data objects retrieved

through database queries for comparison. For example, the user may like to compare the general

features of food products whose sales decreased by 10% in the last year with those whose sales

increased by at least 30% during the same period. The methods used for data discrimination are

similar to those used for data characterization. While the forms of output presentation for

discrimination descriptions may be similar to those for characteristic descriptions, the

discrimination descriptions should include comparative measures that help distinguish between

the target and contrasting classes. Figure 1-15 shows (a) a sample set of sales data, (b) line graph

and (c) pie graph. In addition, Figure 1-16 displays its (a) bar graph and (b) multidimensional

tables with crosstabs.

Product Sales ($)

Product A Product B Product C Product D January 2011 2000 1200 1800 2200

February 2011 3000 2000 2500 4000 March 2011 1500 2000 3200 4400

April 2011 2400 3000 1200 3700 May 2011 2500 1500 800 2000 June 2011 3100 3200 1100 4000 July 2011 2400 2800 2400 4200

August 2011 1200 800 1000 1800 September 2011 3500 2500 2000 3200

October 2011 4000 2000 1400 3500 November 2011 2000 3000 2400 4500 December 2011 2400 2100 1800 3200

(a) A sample set of sales data

(b) Line graph (c) Pie graph

\

Figure 1-15: Data characterization and discrimination (1):

(a) a sample set of sales data, (b) line graph and (c) pie graph.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Product A

Product B

Product C

Product D

Product A

2000

Product B

1200

Product C

1800

Product D

2200

January 2011


22

(a) bar graph

Product Sales ($)

Product A Product B BKK TKO TOTAL BKK TKO TOTAL

January 2011 2000 2400 4400 1200 3200 4400 February 2011 3000 2000 5000 2000 2800 4800

March 2011 1500 3000 4500 2000 2400 4400 April 2011 2400 2200 4600 3000 2000 5000 May 2011 2500 2600 5100 1500 2500 4000 June 2011 3100 3500 6600 3200 2800 6000 July 2011 2400 2200 4600 2800 2500 5300

August 2011 1200 1000 2200 800 3000 3800 September 2011 3500 3000 6500 2500 2800 5300

October 2011 4000 4500 8500 2000 3400 5400 November 2011 2000 2500 4500 3000 3200 6200 December 2011 2400 2600 5000 2100 2800 4900

(b) multidimensional tables with crosstabs

Figure 1-16: Data characterization and discrimination (2):

(a) bar graph, and (b) multidimensional tables with crosstabs.

1.6.6. Meta Functionalities: Link, Outlier, and Trend/Evolutional Analysis

The classical functions, i.e., classification, prediction, clustering, association mining,

characterization, and discrimination, can be applied in more application-oriented tasks, including

link analysis, outlier analysis and trend/evolutional analysis. As an instance of multi-relational

data mining, link mining encompasses a range of tasks in both descriptive and predictive

modeling. It may use classification, clustering and association to mine knowledge from linked

relational domains. With the introduction of links, there are potential new tasks, such as inferring

the existence of a link, predicting the numbers of links, predicting the type of link between two

objects, finding co-references, and discovering subgraph patterns. A more recent application on

link analysis is mining in social networks, where relationships between people are represented

as links in a graph. Link-based object classification predicts the category of an object based not

only on its attributes, but also on its links, and on the attributes of linked objects. Web page

classification is a well-recognized example of link-based classification. It predicts the category of

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Jan

20

11

Feb

20

11

Mar

20

11

Ap

r 2

01

1

May

20

11

Jun

20

11

Jul 2

01

1

Au

g 2

01

1

Sep

20

11

Oct

20

11

No

v 2

01

1

Dec

20

11

Product A

Product B

Product C

Product D


23

a web page based on word occurrence (words that occur on the page) and anchor text (the

hyperlink words, that is, the words you click on when you click on a link), both of which serve as

attributes. A similar approach can also be applied to classify articles in journal publication

databases using bibliography information. A classification task is to predict the topic of a paper

based on word occurrence, citations (other papers that cite the paper), and co-citations (other

papers that are cited within the paper), where the citations act as links. In many cases, a database

may contain data objects that do not comply with the common behavior or model of the data.

These data objects are known as outliers. Such deviated data can be treated as noise or

exceptions. However, in some applications such as fraud detection, such rare events can be more

interesting than the more regular ones. Outlier mining is implemented with statistical tests to

find objects that substantially are diverged from the norm of any other cluster and then to

consider them as outliers. Besides statistical or distance measures, deviation-based methods is

an alternative to identify outliers by examining differences in the main characteristics of objects

in a group. Trend and evolution analysis describes and models regularities or trends for objects

whose behavior changes over time. Although this may include characterization, discrimination,

association and correlation analysis, classification, prediction, or clustering of time-related data,

distinct features of such an analysis include time-series data analysis, sequence or periodicity

pattern matching, and similarity-based data analysis.

1.6.7. Visualization

Visualization is an important process in order to utilize the result of data mining or knowledge

discovery efficiently. The visualization is known as a set of techniques to construct visual objects,

including images, diagrams, or animations to communicate abstract or concrete ideas, concepts,

thought and findings to specific audiences or public on what one intends to convey. Currently

visualization has ever-expanding applications in science, education, engineering (e.g. product

visualization), interactive multimedia, medicine, chemical informatics, bioinformatics, in the field

of computer graphics. Recently, the development of animation has provided advance

visualization, especially producing interactive and dynamic contents.

Two levels of visualization in the field of data mining/knowledge discovery are information

visualization and knowledge visualization. Information visualization means the way of using

computer-aided tools to explore large amount of abstract data in order to make us catch the

figure of information more efficiently and effectively. This area is originally posted up by the User

Interface Research Group at Xerox PARC. Major steps in information visualization are selecting,

transforming and representing abstract data in a form that facilitates human interaction for

exploration and understanding. Two main aspects of information visualization are dynamics of

visual representation and the interactivity. More advance techniques allow users to modify the

visualization format or pattern in real-time manner, then to provide multi-access to various

perceptions of patterns and structural relations in the abstract data in question.

Analogous to information visualization, knowledge visualization uses visual aids/tools and

computer and non-computer based visualization methods complementarily, to convey or

communicate knowledge among people in society in order to improve the transfer of knowledge.

Visual formats may be in the forms of sketches, diagrams, images, objects, interactive

visualizations, etc. While information visualization makes use of computer-supported tools to

derive new insights from facts or data, knowledge visualization aims to transfer insights and

create new knowledge in groups, including patterns, rules, prediction results, experiences,

attitudes, values, perspectives, opinions and expectations by using various complementary

visualizations.


24

Sex

Male Female

(a) Mosaic plots (1 dimension)

Sex Male Female

Na

tio

na

lity

Ov

ers

ea

Lo

cal

(b) Mosaic plots (2 dimension)

Male Female

Young Mid Aged Young Mid Aged

Ov

ers

ea

Na

tio

na

lity

Lo

cal

(c) Mosaic plots (3 dimension)

Figure 1-17: Mosaic plots for representing association mining results


25

(a) 3D Bar Graph

Left-Hand-Side Right-Hand-Side Support Confidence orange fish 0.012 0.75

banana shoes, shirt 0.014 0.85

fish, meat television 0.015 0.67

milk rice 0.102 0.56 shirt pork 0.105 0.25

apple orange water 0.043 0.91

television oven, radio 0.025 0.67

(b) Simple representation of association rules

Figure 1-18: Bar graph and simple representation of representing association mining results

As some examples, the knowledge in the form of association mining results can be visualized

through Mosaic plots as shown in Figure 1-17 as well as 3D bar graph or simple table form, as

shown in Figure 1-18. Mosaic plots display association rules in a 2D plane including the box

showing the co-occurrence frequency. They directly illustrate interestingness measure, showing

differences of confidence. An association rule can also be expressed in a 3D bar graph where the x

axis indicates the rule’s left-hand-side, the y axis displays the rule’s right-hand-side, the z axis

shows the rule’s support level, and a bar in this 3D space indicates the rule’s support by height

and the rule’s confidence by color. A naïve representation for an associate rule is in the form of a

table, where the first column indicates the rule’s left-hand-side, the second column shows the

rule’s right-hand-side, and the next two columns illustrate the rule’s support and confidence,

respectively.

Besides visualization of mining results, a more general functionality is visual data mining,

which integrates data mining and data visualization in order to discover implicit and useful

knowledge from large data sets. It includes data visualization, mining result visualization, mining

process visualization, and interactive visual data mining. Audio data mining uses audio signals to

indicate data patterns or features of data mining results. Image data mining displays images as

data patterns or mined result.


26

1.7. Challenges in Data Mining and Text Mining

In this section, we describe some major considerations in data mining and text mining related to

data and knowledge heterogeneity, mining techniques, user interaction, and mining performance.

Several of them are current and future research topics.

Data and Knowledge Heterogeneity

Nowadays electronic data collection technology and database technology bring about titanic

amounts of data stored in several types of databases, data warehouses and other data

repositories. .While some generated data, such as those with relational tables or transaction

tables, are simple, some data are complex data objects, e.g. hypertext and multimedia data,

spatial data, or temporal data. Although it enables us to track events occurred widespread, it also

triggers a most important issue on how to handle tremendous data, which may come to the

system with various sources, different forms, diverse timing, and disperse granularity. It is a very

challenging topic for mining information and knowledge from such heterogeneous databases

and/or global information systems connected in both local- and wide-area computer networks

(typically the Internet). The discovery of knowledge from different sources of structured, semi-

structured, or unstructured data with diverse data semantics poses great challenges to data

mining. Data mining and knowledge discovery may help reveal high-level data

regularities/patterns in multiple heterogeneous databases (such as text, audio, visual,

multimedia, geographical and web databases) that are usually hard to detect by single simple

query systems but may improve information exchange and interoperability among users.

Moreover, mining from web contents, web structures, web usages and web dynamics becomes

one of very challenging and fast-evolving fields in data mining and knowledge discovery. Besides

heterogeneity in input data, as the discovered output, knowledge or pattern can be in various

primitive forms and often should be in a mixed format. In general, the output format depends on

the type of data analysis and knowledge discovery tasks, e.g., data characterization, data

discrimination, classification, prediction, clustering, association and correlation analysis, outlier

analysis, and evolution analysis. Based on the type of the tasks, numerous different data mining

techniques have been developed. Clearly, it is unrealistic to expect a single system to mine all

kinds of data with diverse data types and different mining goals. Therefore, specific data mining

systems are required for mining specific kinds of data. However, it is necessary to consider

appropriate methods to combine or integrate results from different data mining systems to

achieve higher-level goals that are usually complicated.

Interactive Mining and Visualization

Besides traditional static mining where single-pass mining is often assumed, alternatively

interactive mining of knowledge is needed in several situations, including mining at multiple

levels of abstraction, mining at different time period or granularities (e.g. day, week or year), and

mining in different points of view. Since it is difficult to know exactly what can be discovered

within a database, it will be practical if the data mining process can be done in an interactive

manner. For large-scaled databases, a sort of sampling techniques may be used to facilitate

interactive data exploration, as well as reduction of computational complexity. Such interactive

mining allows users to point down to find more specific patterns or to refine data mining results

based on newly provided user request. Normally novel useful unknown knowledge can be found

by drilling down, rolling up, slicing and dicing, and pivoting through both the data space and

knowledge space interactively, like what OLAP (online analytical processing) can perform on

data cubes. According to this concept, the user can interact with the data mining system to view


27

data and discovered patterns at multiple granularities and from different angles. Towards

interactive data mining, we may need a sort of query languages for ad-hoc query. Analogous to

SQL (structured query language) used for querying in relational databases, a high-level data

mining query language need to be developed to enable users to describe situations for ad hoc

data mining tasks. The language will be used to represent the specification of the relevant sets of

data for analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions

and constraints to be enforced on the discovered patterns. Moreover, it should be incorporated

into a database query language and optimized for efficient and flexible interactive mining.

Towards efficient interactive mining, instead of text-based input data, text-based queries and

text-based mining results, presentation and visualization of input data, queries and mining

results is an essential component. Representing input data graphically will help us perceive input

data directly and then improve the interpretation of input data. The visualization of a query will

simplify the process that users make a query to a system to discover what they intends to find.

Via this support, even novice users in using a computer can utilize the system. Moreover,

discovered knowledge should be expressed in visual representations or other expressive forms

so that the knowledge can be easily understood and directly usable by humans. This requires the

system to adopt expressive knowledge representation techniques, such as trees, tables, crosstabs,

matrices, rules, graphs, charts, or curves. For text mining, it may be in the form of document

graphs, phrase graphs, named entity graphs and so on.

Background Knowledge Exploitation, Data Pre-preprocessing and Pattern Evaluation

Background knowledge, information, or data regarding the domain on focus, may be required in

order to guide the discovery process towards what we look for, and/or allow discovered patterns

to be expressed in concise terms and at different levels of abstraction. Domain knowledge related

to databases, such as integrity constraints and deduction rules, can help us focus and speed up a

data mining process, or judge the interestingness of discovered patterns. For example, in a client

information system, we might set beforehand that rules connecting the client’s address and

telephone are not interesting since it seems trivial. This property relates to the normalization

process in relational databases. Relational candidate keys can be background knowledge in this

case. As another case, in a market basket database, we might assign that the interestingness of a

rule is proportional to the occurrence frequency of the rules multiplied by a function responsive

to the prices of the items. By this, it will set more interestingness to the rules of high frequency

that involve expensive items. Generally, there are several methods to introduce background

knowledge to the system. A practical system should have a function to support users to use such

knowledge (application-dependent criteria) to distinguish interesting and trivial patterns.

As a specific application, it is possible to exploit background knowledge to handle noisy or

incomplete data. While the data stored in a database may reflect noise, exceptional cases, or

incomplete data objects, knowledge related to the regularity of data can help us to clean noisy

data, to handle exceptional data, or to complement incomplete data. By this preprocessing, when

mining data regularities, we can avoid overfitting of the data and then improve the quality of the

discovered patterns. Data cleaning methods and data analysis methods can handle noise, as well

as outlier mining methods can cope with the discovery and analysis of exceptional cases.

With the help of background knowledge, it is possible to evaluate patterns to find the

interesting ones. Normally a data mining system can discover thousands (or up to hundreds of

thousands) of patterns. Naturally, many of them are not interesting to the given user, either

because they represent common knowledge or lack of novelty. Several challenges focus on

developing techniques to assess interestingness of discovered patterns, with background

knowledge as subjective measures representing the value of patterns based on user expectations


28

or beliefs. The background knowledge in the form of user-specified constraints or interestingness

expectation can be used to guide the discovery process and then reduce the search space.

Overfitting

In fields of statistics and machine learning, overfitting occurs when a statistical or learned model

describes random error or noise instead of the underlying relationship or knowledge. Overfitting

generally occurs when the model is overlearned and becomes excessively complex, such as

having too many parameters relative to the number of observations. In the conceptual level,

overfitting is triggered since the criterion used for training the model is not the same as the

criterion used to judge the effectiveness of a model. In particular, a model is trained by

maximizing its performance on some set of training data. However, its effectiveness is

determined not by its performance on the training data but by its capability to perform well on

unseen data. During generalization of the model, overfitting occurs when the model begins to

memorize the training data rather than learning to generalize from the model. As an extreme

example, a simple model can learn to perfectly predict the training data simply by memorizing

the training data in its entirety but this model will typically fail drastically on unseen data, as it

has not learned to generalize at all. We can solve the overfitting problem by stopping

generalization of a model at some early stage before the model becomes too complex and fit with

the data used for learning.

Performance and Scalability

Another important concern in data mining and text mining is performance and scalability of

mining algorithms. Normally, towards effective information extraction from a huge amount of

data in databases, data and text mining algorithms must be efficient and scalable. That is, the

execution time of those algorithms has to be acceptable, even for a large-scale database. Towards

this, a solution may come from development of a more efficient mining algorithm, parallelization

of mining algorithms, incremental mining or constraint-based mining. In many cases, it is

possible to modify original algorithms or introduce new algorithms to speed up or scale up

mining on a large-scale database, using appropriate indexing and other optimization techniques,

such as FP-Tree (Han et al., 2000), Prefix-tree (Crahne and Zhu, 2003), OP (Liu et al., 2002),

Array-based approach (Grahne and Zhu, 2003), and PrefixSpan (Pei et al., 2001; Pei et al., 2004).

Moreover, the increasing size of databases, the wide distribution of data, and the computational

complexity of data mining techniques motivate the development of parallel and distributed data

mining algorithms. Normally a parallel algorithm will divide the data into partitions and then

process them in parallel. The results obtained for the partitions will be merged into the final

solution. By this parallelism, we can fasten the mining process. Another approach to avoid the

expensive cost of performing data mining is incremental data mining where the mining process is

performed on only the newly coming data without having to mine the entire data again “from

scratch.” With lower cost, such algorithms perform knowledge modification incrementally to

revise and strengthen the previously discovered patterns or knowledge.

The issues described in this section are main challenges in the future of data mining

technology. Furthermore, some additional issues relate to applications, privacy, and the social

impacts of data mining.

1.8. Summary

With development of database technology, database management systems are used in place of

primitive file processing with the development of query and transaction processing. Further


29

progress has led to the increasing demand for efficient and effective advanced data analysis tools,

as a result of the explosive growth in data collected from applications, including business and

management, government administration, science and engineering, and environmental control.

Data mining is the task of discovering interesting patterns from large amounts of data. It is still a

young interdisciplinary field, drawing from several areas such as database systems, data

warehousing, statistics, machine learning, data visualization, information retrieval, and high-

performance computing. Other contributing areas include neural networks, pattern recognition,

spatial data analysis, image processing, signal processing, and many application fields, such as

business, economics, and bioinformatics.

A knowledge discovery process includes application definition, dataset preparation, data

cleaning and transformation, task selection, algorithm selection, data transformation, data

mining, pattern interpretation and evaluation, and knowledge consolidation and exploitation.

Interesting data patterns can be extracted from relational databases, transactional databases, as

well as other types of information repositories, such as spatial, time-series, sequence, text,

multimedia, and legacy databases, data streams, speech, movie, and the World Wide Web. Data

mining tasks include the discovery of concept/class descriptions, associations and correlations,

classification, prediction, clustering, trend analysis, outlier and deviation analysis, and similarity

analysis. The mined patterns should represent knowledge which is easily understood by humans;

valid on test data with some degree of certainty; and potentially useful and novel. To guide the

discovery process, measures of pattern interestingness can be either objective or subjective.

Objective measures include support, confidence, conviction and lift while subjective measures

are those specified manually by human, such as novelty, interestingness and creativeness.

Challenges in data mining and text mining are knowledge heterogeneity, interactive mining

and visualization, background knowledge exploitation, data pre-preprocessing and pattern

evaluation, mining performance and scalability.

1.9. Historical Bibliography

The early collection of research works on knowledge discovery in databases or data mining was

edited by Piatetsky-Shapiro and Frawley (1991) and later by Fayyad, et al. (1996). Thereafter,

many data mining books have been published in recent years, such as Predictive Data Mining by

Weiss and Indurkhya (1998). As a statistical book, Hastie, Tibshirani, and Friedman (2001) wrote

a statistical-oriented book, namely “The Elements of Statistical Learning.” As a good recent book,

Han and Kamber (2004) have published a book namely Data Mining: Concept and Techniques. As

a more practice book with Java implementation, Witten and Frank (2003) have published a

practical book in order to illustrate how to use the Weka system and then currently the third

edition in 2011 (Witten et al., 2011). For automatic association analysis, Piatetsky-Shapiro

(1989) describes analyzing and presenting strong rules discovered in databases using different

measures of interestingness. Based on the concept of strong rules, Agrawal et al. (1993)

introduced association rules for discovering regularities between products in large-scale

transaction data recorded by point-of-sale (POS) systems in supermarkets. Link mining is a

newly emerging research area that is at the intersection of the work in traditional link analysis

(Jensen and Goldberg, 1998), hypertext and web mining (Chakrabarti, 2002), relational learning

and inductive logic programming (Dzeroski and Lavrac, 2001), and graph mining (Cook and

Holder, 2000). Last, Kandel and Bunke (2004) provide a good introduction to mining techniques

in time series.


30

Exercise

1. Describe what difference between data mining and text mining are.

2. Describe the steps of data mining process.

3. Describe the major additional characteristics of temporal databases, compared to

conventional relational databases or transactional databases.

4. List some examples of data mining applications on medical databases.

5. Explain what interestingness is, how it is different from criteria to discover knowledge and

why we need to consider the interestingness in data mining.

6. Explain how to apply data mining to find interesting patterns from Web. Here, three possible

mining sources on the web are (1) Web Log, (2) Web Content, and (3) Web Structure.

7. Explain how text mining helps in finding useful information from electronic documents, such

as WIKIpedia.

8. Explain how classification and clustering differ from each other.

9. Provide explanation which of numeric prediction or classification is harder.

10. Give some examples of supervised learning, unsupervised learning and semi-supervised

learning.

11. Explain what effects from outliers on the mining results are.

12. Plot a line graph, a bar graph, a pie graph for the following data related to flood in Thailand

during 2011.

Water Level (cm)

Sukhothai Ayutthaya Pathumthani Bangkok

January 2011 100 50 20 40 February 2011 200 100 50 60

March 2011 300 150 100 80 April 2011 500 200 150 60 May 2011 600 300 200 70 June 2011 500 400 300 80 July 2011 450 500 400 100

August 2011 400 650 550 150 September 2011 300 800 600 200

October 2011 250 700 700 400 November 2011 300 600 850 500 December 2011 200 550 750 550