artificial intelligence vrs statistics

AI versus Statistics : Some common topics.

Karina Gibert(1) Jorge Rodas(2) Javier Gramajo(2)[email protected] [email protected] [email protected]

(1) Technical University of Catalonia.Statistics and Operation Research Department.

Pau Gargallo 5.08028 Barcelona, Spain.

(2) Technical University of Catalonia.Informatic Systems and Languages Departament.

Jordi Girona Salgado 1-3, M. C6, D. 201.08034 Barcelona, Spain.

January 9, 2008

Abstract

This document addresses the existence of some topics, tradition-ally studied by both Statistics and Artificial Intelligence and tries toestablish resemblance and differences between the techniques proposedin one field and another. To finish, possibilities of those techniques inthe modern information society are studied.

Keywords: Clustering, Classification, Artificial Intelligence, Statis-tics, Data Mining, Knowledge Discovery.

1

1 Introduction

The document focuses to some formal problems which have been studiedeither from an Statistical point of view and from the Artificial Intelligencepoint of view.

Different techniques were proposed by both sciences to solve same situa-tions and a discussion on those solutions is presented here.

Authors do not pretend to be exhaustive at all. But to suggest a reflectionon some application aspects of those techniques.

The structure of the document is the following: First of all a very briefhistory of both Statistics and Artificial Intelligence is introduced in sections§2 and §3 where there are described and this allows the reader to understandnext sections. An introduction to Machine Learning is given in section §4and an overview of Inductive Learning and Conceptual Clustering in section§5. Some solutions proposed by Statistics and Artificial Intelligence sciencesare discussed and differences and resemblance are analyzed in sections §6and §7. Finally, a brief summary of Knowledge Discovery in Databases andData Mining history, new trends related to the information society and newtechnologies, and some expectations about its future is introduced in section§8.

2 Statistics Essentials

Statistics is a very ancient science and covers a very broad field, from pureprobability theory to the modern theories of exact statistics. However, inthis section we present a few historical references mainly those which arerelevant for the document. For detailed history of Statistics see Droesbeke§[15], Everitt §[18].

The term Statistics derives from latin word Status, which refers to thepolitical and social situation, to the State. The origin of the Statistics can beset in the first counting–oriented works of Ancient Empires: China, Egypt,Greece,. . . experimented soon the need of counting inhabitants, houses, fam-ilies, quantities of picked fruits in a certain year, etc.

So, Statistics began as the science of collecting the economics and demo-graphics data which were relevant to the State. It evolved along years andtoday can be defined as the science concerned with collection and analysis

2

of data, in order to extract its contained information and to present it in anunderstandable and synthetic way.

At the beginning only a descriptive purpose held. Later, sampling tech-niques were used and knowledge about the whole population was inferredfrom the analysis of the sample. Sampling theory and mathematical Statis-tics are concerned with those goals.

The end of s. XVIII constituted a fertile scientific corpus. It is Darwintime; in that period Galton presented (1877) the first works on regressionanalysis, and Pearson presented, among other works, in 1901 a preliminaryversion of principal components analysis. His main disciple was Fisher (1890–1962) whose works can be considered the establishment of modern statistics.

About the 1930, Psychometry set out new problems which triggered de-veloping of Multi-variant Data Analysis. Measure of latent factors whichcould not be directly observed, like intelligence, memory, . . .motivated thefactor analysis due to Spearman (1904). Hotelling (1933) generalized thisideas and Pearson’s ones in the multi-variant technique of principal compo-nent analysis. Those techniques are mainly descriptive and try to show aperspective of the relationships among the variables altogether.

In this line, Fisher and also Mahalanobis presented, in 1936, the firstworks about discriminant analysis in which it exists a response variabletelling the class of every object; the best linear combination of all the vari-ables for distinguishing the classes is found.

On the other hand, the formation of and distinguishing between diffe-rent classes of objects (clustering) has been in use for very long. Statisticsfaced this problem and in 1757 Andanson introduced the principles of usingdistances between objects to group the more similar in classes and iterate theprocess with the classes to obtain a hierarchy. Clustering techniques requirelots of calculations. They regain actuality when computers became powerful:in 1963 Sokal and Sneath presented The Numerical Taxonomy which can beconsidered the first modern formulation of clustering. Modifications of thesebasic algorithms have been presented till today, but basic ideas still remain.

3 About Artificial Intelligence

Artificial intelligence (AI) is also broad field, but much more younger thanStatistics. AI as a formal discipline has been around for a little over thirty

3

years. We can consider that the establishment of AI as a discipline occurredduring the period from the famous Dartmouth Conference of 1956 when Mc-Carthy coined the term on .

Artificial Intelligence is concerned with getting computers to do tasksthat require human intelligence.

A reasonable characterization of the general field is that it is intendedto make computers do things, that when done by people, are described ashaving indicated intelligence §[5].

We believe that AI techniques pretends to build software and/or hardwarethat do what human and animals do. Basically AI pretends machines to beable to learn in the way that human or animals learn, to communicate, toreason, to classify, to make decisions, etc.

However there are many tasks which we might reasonably think requireintelligence - such as complex arithmetic - which computers can do veryeasily. Conversely, there are many tasks that people do without even thinking- such as recognizing a face - which are extremely complex to automate. AIis concerned with these difficult tasks, which seem to require complex andsophisticated reasoning processes and knowledge.

The Artificial Intelligence history could be graphically represented by anice metaphor, a simple plant, such as : Nopal, for example §[58]. Nopal(Opuntia spp), also known as the Prickly Pear Cactus, is a member of theCactus family (see fig. 1).

The AI, like whatever plant, has :

• roots : Parallelism and Von Neumann,

• stems : Micro-distributed AI, Evolutionary AI, etc.,

• fruits : Natural Language Processing Techniques, Neural Networks,etc.,

• nutriments : Neurophysiology, Linguistics, etc., and

• supports : Robot-Building-Simulation, Computational Techniques andCognitive Science.

The AI beginning was disorganized but today it is a systematized dis-cipline. First AI began under the Von Neumman paradigm and sequential

4

computational techniques; it was naturally supported by the mathematicsand logic disciplines developed at that moment. Then AI was based on thesymbol idea, introduced by Newell and Simon in 1956. In their work on theLogic Theorist §[59], a program that proved logic theorems by searching atree of subgoals was presented. The program made extensive use of heuris-tics to prune its search in a symbolic space of solutions. With this success,the idea of heuristic search soon became dominant within the still tiny AIcommunity. AI was broken into five key topics : search, pattern recogni-tion, learning, planning and induction; in 1961 §[54] by Minsky. Most of theserious work in AI according to this breakdown was concerned with search.

Eventually, after much experimentation, search methods became well un-derstood, formalized and analyzed, and became celebrated as the primarymethod of Artificial Intelligence.

At the end of the era of establishment, in 1963, Minsky generated anexhaustive annotated bibliography §[55] of literature directly concerned withconstruction of artificial problem-solving systems. There are two main pointsof interest here : an introduction to artificial problem solving systems and AIbibliography. This bibliography includes many items on cybernetics, neuro-science, bionics, information and communication theory, and first generationconnectionism.

Then advances in AI were explosive, especially when the first successesof the application in diagnostic oriented problem solving (i.e. MYCIN, 1976,diagnose infections) and other techniques such as : expert systems, knowl-edge representation, machine learning (see §4) , reasoning, natural languageprocessing, etc were publicized. However symbolic representations showedserious limitations when facing complex real big problems, mainly becausealmost IA problems are NP-complete. As a consequence, about later 70’s,the production of these techniques become to decrease and other paradigmswere explored to find new solutions to the typical AI problems. Actuallywhat ”new” techniques do is to change tools used to solve problems butobjectives still remain.

Considering the parallel paradigm (computer’s parallel architecture), some-thing called micro-distributed AI appeared about the 70’s and some authorscalled it, by its implicit metaphor : artificial neural networks (ANN). ANNare architectures of simple operator’s networks that imitate the natural neu-ral network (NNN) structure. ANN were born in 1943 when McCulloch andPitts described them by first time. These architectures were introduced in

5

the AI context and were used to solve the typical AI problems. ANN arebased on an unrepresentativeness paradigm where the concept of program-ming is substituted by concept of training. At the beginning using thesetechniques gave some successes (i.e. pattern recognition problems). Minskyand Papert published Perceptrons (1969); where they shown that ANN hadextremely limited representation ability. That is why many have people de-serted of ANN and only few researchers continued with Minsky and Papertefforts, most notably Teuvo Kohonen, Stephen Grossberg, James Anderson,Kunihiko Fukushima, Rumelhart, and McCleland. The interest in ANN re-emerged only after some of their important theoretical results were attainedin the early eighties (1982).

Recent approaches were inspired by a recent learning algorithm known asback propagation.

Back propagation has some problems; it is slow to learn in general, andthere is a learning rate which needs to be tuned by hand in most cases. Theseproblems combine to make back propagation, which is the cornerstone ofmodern neural network research, inconvenient for use in embodied or situatedsystems.

In between the parallel paradigm and symbol paradigm, evolutionaryAI and the macro-distributed AI appear. The first one is characterized bygenetic algorithms and the second one by the multiagents systems and othertechniques.

Evolutionary AI, with a parallel heuristic search, inside a big solutionsspace, searches optimum solutions based on a reward and punishment tactic.The macro-distributed AI problems share with the micro-distributed AI theidea of an operator’s network, but now it is not a simple network, sinceit is compound by computational co-operative agents of great complexitythemselves. Multiagent systems are on the bases of reactive robots, which isthe hard area of AI.

4 Machine Learning.

A set of techniques that involves searching a very large space of possiblehypotheses to determine one that best fits the observed data and any priorknowledge held by the learner. The goal of Machine Learning is createssystems that improve its own task performance by the experience acquisition

6

from data.Machine Learning draws on ideas from a diverse set of disciplines: arti-

ficial intelligence, probability, statistics, computational complexity, informa-tion theory, and philosophy §[56].

Data analysis techniques have been traditionally used for such task in-clude regression analysis, cluster analysis, numerical taxonomy, multi-dimensionsanalysis, other multi-variate statistical methods, stocastic models, time seriesanalysis, nonlinear estimation techniques, and others (e.g., Daniel and Wood,1980; Tukey 1986; Morganthaler and Tukey, 1989: Diday, 1989: Sharma,1996). These techniques are used for solve a lot of practical problems. How-ever, they are oriented to extraction of quantitative and statistical data char-acteristics.

For example, an statistical analysis can determine variables covariancesand variables correlations in data. However it could not characterizes de-pendencies in an abstract conceptual level. To do this task, a data analy-sis systems have to be equipped with a substantial amount of backgroundknowledge, and be able to perform symbolic reasoning tasks involving thatknowledge and the data.

In order to solve some of these limitations, researchers look for ideasand methods developed in Machine Learning field. It is a natural sourceof ideas for this purpose. The essence of Machine Learning is to developcomputational models for acquiring knowledge from facts and backgroundknowledge. These efforts had been gathered in a new research area, knownas Data Mining or Knowledge Discovery in Databases (e.g. Michalski, Baskinand Spackman, 1982; Zhuravlev and Gurevitch 1989; Michalski et al, 1992;Van Mechelen et al, 1993; Fayyad et al, 1996; Evangelos and Han 1996).

5 Inductive Learning and Conceptual Clus-

tering overview.

We can consider two main learning tasks in Machine Learning:

• Inductive Learning : The human being creates patterns to make anattempt to understand his environment. This process is called inductivelearning §[34].

7

In the learning process, humans and animals (cognitive systems) ob-serve their environment and recognizes similarities between objects andevents in it. They group similar objects in classes and make rules thatpredict the behavior of no classified items of that class.

• Conceptual clustering : It is a Machine Learning task defined by Michal-ski in 1980. A conceptual clustering systems accepts a set of objectdescriptions (events, observations, facts) and produces a classificationscheme over the observations. These systems not require a ”teacher” topreclassify objects, but use and evaluation function to discover classeswith ”good” conceptual description.

The automation of Inductive Learning and Conceptual Clustering pro-cesses has been extensive researched in Machine Learning (see §[56]), anArtificial Intelligence research area, as we know.

At this point, we already define Inductive Learning and Conceptual Clus-tering, so we can define two learning systems techniques: Supervised Learn-ing and Unsupervised Learning, both of them are related with the InductiveLearning and the Conceptual Clustering respectively.

Supervised Learning. Techniques in supervised learning, look for the defi-nitions of classes made by the teacher . In unsupervised learning, they makea summary of the training set as a simple set of teacher’s descriptions ornewly discovered classes with their descriptions.

There are several forms of representing patterns that can be discoveredby machine learning, and each one has techniques that can be used to inferthe output structure from the knowledge representation data. These struc-tures could take the form of decision trees and classification rules and theyare the basic knowledge representation styles that many machine learningmethods use. Some of these representation examples are : complex varietiesof rules, special forms of trees, and instance-based representations. Finallysome learning schemes generate clusters or instances.

Unsupervised Learning. In unsupervised learning, or learning from ob-servation and discovery, the system has to find its own classes in a set ofstates, without any help of a teacher. Practically, the system has to findsome clustering of the set of states S. The data mine system is supplied ob-jects, as in supervised learning, but now, no classes are defined. The system

8

has to observe the examples, and recognize patterns (e.g. class descriptions)by itself. Hence, this learning form is also called learning by observation anddiscovery (see §[49]). The result of an unsupervised learning process is a setof class descriptions, one for each discovered class, that together cover allobjects in the environment. These descriptions form a high-level summaryof the objects in the environment.

Holsheimer and Siebes believe that unsupervised learning is actually notdifferent from supervised learning, when only positive examples of a singleclass are provided. Thus, we search a description that describes all objects inthe environment. If different classes exist in the environment, this descriptionwill be composed of the descriptions of these newly discovered classes.

6 Clustering

In apprehending the world, men constantly employ three methods of organi-zation, which pervade all of their thinking:

• the differentiation of experience into particular objects and their attri-butes;

• the distinction between whole objects and its parts and

• the formation and distinction of different classes of objects.

Classically, the third task is defined as a clustering problem, i.e. theproblem of identifying the natural distinguishable groups of similar objectsin a set.

As said before, human have been doing that almost from the beginning.And at the beginning, finding the clusters of a group of objects was almostan art, with an important dose of common sense from the author himself.

In s. XVIII clustering problems became important in the context ofBiology and Zoology in order to organize the species of living beings. Themost famous work on this line was the clustering of living beings made byLinneus, still valid today.

In 1757, for the first time, Andanson established the principles of anobjective and systematic way for making clusters in a set of objects. Themethod was based on the measure of distances between objects; more simi-lar objects were clustered together and the process is iterated on the classes.

9

Finally a hierarchy among the classes is produced. The most used represen-tation of a clustering process is the hierarchical tree, also called dendrogramin the statistical context.

However, using this techniques was tedious, because of the high numberof calculations required.

When computers became powerful, clustering techniques received majorboost. In 1963, Sokal and Sneath presented the first modern formulation ofnumerical taxonomy. The basic algorithm was that of Andanson, but manyways of measuring the distance (euclidean, χ2, inertias. . . ) between objectsand of creating the classes were studied. Other families of clustering methodsappeared:

• partition methods: in which no dendrogram is formed (dynamic cloudsDiday §[12], k-means MacQueen§[48] agglomerative Volle §[70],. . . )

• pyramidal cluster: in which the tree is no binary anymore Biday-Britto§[13]

• additive trees: based on graphs Roux §[66]

but all of them shared some common characteristics: the use of distancesbetween individuals and the metrics structure implied; the use of numericalvariables to describe the objects.

However, it is clear that in certain applications some non-numerical vari-ables are really relevant. The introduction of categorical variables for clus-tering required the use of special distances (χ2 Benzecri §[3], . . . ) and arousediscussion on the interpretation of the results.

On the other hand, from the AI point of view, research in learning in-cluded also the design of algorithms for identifying the clusters of a given setof objects as one of the intelligent human tasks to be performed by machines.In that context statistical techniques were considered poor: on the one handbecause qualitative variables (categorical ones in statistical terminology) weremainly used in that field; on the other hand, the main goal for AI was notonly to identify the classes, but to be able to understand the process usedfor doing that. In that sense, statistical methods, based on mathematicalconcepts (i.e. metrics) were not understandable enough. Using these argu-ments, Michalsky presented in 1983 the conceptual clustering in which logicconcepts were associated with classes and generalized (on the basis of logic

10

operations) in order to include the maximum number of observed entities (orobjects) and the minimum number of non observed ones.

Conceptual clustering is a completely new approach to the clusteringproblem. The basis is not mathematics anymore, but logics. Elements todeal with are not measures on the objects anymore, but qualitative valuesand logic concepts. The goal is not only discovering of classes, but being ableto understand the clustering process which produced them; also being ableof getting conceptual descriptions of the produced classes.

On the bases of this algorithm, some other methods were proposed: COB-WEB, Fisher §[23] 1987 performs an incremental conceptual clustering basedon an utility measure related with classical information measure (Gluck andCorter category utility).

Autoclass (Cheeseman and Stutz 1995) is a comprehensive Bayesian clus-tering scheme that uses the finite mixture model, with prior distributions onall parameters.

As a matter of fact, two main families of algorithms can be distinguished:Statistical ones were originally designed for numerical variables and they arebased on distances between objects and the properties of metrics spaces (orsimilar); AI ones were originally designed for symbolic management (cate-gorical variables) and they are based on logic conceptual description, gener-alization of concepts, and measures of quality related with information gain.

Both deal with the same problem, both are used to identify the naturalclasses of a given group of objects, when no prior knowledge is able. But asit can be seen, nature of those methods is really different.

7 Classifying

Classifying is the problem of having a set of well-known classes and assignthe corresponding class label to a given unclassified object. As it is obvious,this is different from a clustering problem since in clustering, the structure ofthe target domain is unknown and must be discovered. In classification thestructure of the target domain is known: the existent classes is known andthe goal is to characterize the classes in order to decide the class of a newobject.

Classification typically involves two steps: first the system is trained ona set of data, and then it is used to classify a new set of unclassified cases.

11

From a Statistical point of view, this is solved using the discriminantanalysis techniques, introduced by Fisher and Mahalanobis in 1936. The in-put of the algorithm is a set of examples (i.e. objects for which the belongingclass is already known). The class of each object is considered as the responsevariable. The problem is reduced to find the linear combination of variableswhich best fits the response variable, that is to find the linear discriminantfunction. So, once found this function, and given the values of a new object,identifying its class is reduced to use the discriminant function to determinethe class number. The discriminant function is found by maximizing theinertia inter classes and minimizing the inertia intra-classes. Of course thesolution is found using algebraic paradigm, by finding eigen vectors of somematrices. The main problem of these techniques is to face domains in whichnon linear discriminant function is suited.

Neural nets seems to be able to find non linear discriminant functions us-ing a big set of objects described exclusively by numerical variables. However,they act as black boxes and they do not provide clear conceptual interpreta-tions of the classes.

From the AI point of view, the classifying problems are included in su-pervised learning, since there is a training set where the classes are knownin advance and it is used to learn how to label new objects.

Decision trees approach to the problem of learning from a set of instances.They were introduced by Quinlan with ID3 algorithm in 1986. Main ideais to build a decision tree: The nodes in a decision tree involve testing (ei-ther comparing a particular attribute with a constant, or comparing a pairof attributes, or evaluating some function of several attributes); leaf nodesrepresent classes which include all instances that reach the leaf, or representa probability distribution over all possible classifications. To find the deci-sion tree, ID3 searches from single to complex hypotheses to test in a nodeuntil one consistent with the data is found; the consistence is evaluated by ameasure related to the entropy. To classify a new instance, it is routed downthe tree according to the values tested in successive nodes, and when a leaf isreached the instance is classified into the class represented by the leaf. Thegreat advantage of decision trees is that the meaning of the classes is clearupon the sequence of tests evaluated from the root of the tree to the leafwhich represents the class. Main problem is that when a number of variablesare considered, the tree is too big and heuristic criteria are needed to pruneit and to guarantee enough objects in leaves so as to represent real classes.

12

Sometimes the growth of a decision tree is too big, this is an importantdisadvantage. There are some researches in conversions methods for decisiontrees to others (see §[62]).

In 1993 appears C4.5, a modification of ID3. It starts with large sets ofcases belonging to known classes. The cases, described by any mixture ofcategorical and numeric properties, are scrutinized for patterns that allowthe classes to be reliably discriminated. These patterns are then expressedas models, in the form of decision trees or sets of if-then rules that can beused to classify new cases, with emphasis on making the models understand-able as well as accurate. The j48 Classifier, from 1999, is a decision tree andten-fold cross-validation estimates of its performance. The ID3 family algo-rithms (ID3, C4.5, J48) infers decision trees by growing then form the rootdownward, greedily selecting the next best attribute for each new decisionbranch added to the tree.

Bayesian methods provide one of the oldest methods to perform super-vised classification. A Bayesian classifier is trained by estimating the condi-tional probability distributions of each attribute, given the class label, fromthe database.

Unfortunately the learning efficiency, so precious to the success, is lostwhen the database is not complete; that is, it records some entries as un-known. In this case the exact estimation of each conditional probabilitydistribution, required to define a classifier, is a mixture of the estimationsthat can be computed in each database, and is generated by the combinationof all possible values of the missing data.

Robust Bayesian Classifier (Roc) give several measures; one of them iscoverage measure, this one is the proportion of the cases that the classifier isable to assign to a class and it is computed as the ratio between the numberof classified cases and the total number of cases in the database.

8 Knowledge Discovery in Databases and Data

Mining.

What should AI and Machine Learning community call Knowledge Discov-ery in Databases to Data Mining? The name data mining which was alreadyused in the database community seemed unsexy, and besides statisticians

13

used data mining as a pejorative term to criticize the activity. Mining isunglamorous and there is no indication what are we mining for. Knowledgemining and knowledge extraction did not seem much better, and databaseminingTM was trademarked by HNC for their Database Mining Worksta-tionTM. So, we have Knowledge Discovery in Databases, which emphasizedthe discovery aspect and the focus of discovery on knowledge. The term”Knowledge Discovery in Databases” (KDD for short) became popular in theAI and Machine Learning community. However, the term data mining be-came much more popular in the business press. As of november 1999, searchon www.altavista.com gives about 100,000 pages for data mining, comparedto 18,000 for knowledge discovery. Currently, both terms are used essentiallyas synonyms, as in the name of the main journal for the field – Data Miningand Knowledge Discovery (Kluwer). Sometimes knowledge discovery processis used for describing the overall process, including all the data preparationand postprocessing while data mining is used to refer to the step of applyingthe algorithms to the clean data §[19].

Around 1989, some topics such as : Fuzzy Rules, Learning from relational(structured) data, Integrated Systems, Privacy, etc; attracted the AI andMachine Learning community attention. In those days, Expert DatabaseSystems and discovery of fuzzy rules for example, seemed like good ideasat the time, have disappeared from current research lexicon because theywere examples of technology without a clear application. Some importantareas turned out to be much harder than Machine Learning communitythought in 1989. Learning from structured data is still very difficult andcurrent best methods from the Inductive Logic Programming community[http://www.cs.bris.ac.uk/ ILPnet2/] are still too slow to be used on largedatabases. Interestingness of discovered patterns is still a hard problem, andit still requires significant amount of using domain knowledge. CYC §[43]which held a lot of promise in 1989, did not produce the expected results.On the other hand, we now have the internet which is the largest reposi-tory of general knowledge, although still with very imperfect query system.However, great progress was achieved in faster hardware and bigger disks,enabling data miner to deal with much larger problems. Of curse, the majordevelopment in computers over the last 10 years is the revolution broughtby the Internet. It has shifted the attention of data miners problems of e-commerce and Internet personalization. It also brought much more attentionto Text Mining, and resulted in a number of good text mining systems. An-

14

other major advance was a holistic understanding of the entire KnowledgeDiscovery Process §[4], which encompasses many steps from data acquisition,cleaning, preprocessing, to discovery step, to post-processing of the resultsand their integration into operational systems.

8.1 From General Tools to Domain Specific Solutions.

In 1989 there were only a few data mining tools, produced by researchersto solve a single task, such as C4.5 decision tree §[61] and SNNS neuralnetwork, or parallel-coordinate visualization §[36]. These tools were difficultto use and required significant data preparation.

The second generation data mining systems, called suites, were devel-oped by data mining vendors, starting from around 1995. These tools weredriven by the realization that the knowledge discovery process requires mul-tiple types of data analysis, and most of the effort is spent in data cleaningand preprocessing. The suites such as SPSS Clementine, SGI Mineset, IBMIntelligent Miner, or SAS Enterprise Miner allowed the user to perform sev-eral discovery tasks (usually classification, clustering, and visualization) andalso supported data transformation and visualization. An important ad-vance, pioneered by Clementine, was a GUI which allowed users to buildtheir knowledge discovery process visually.

By 1999, there are over 200 tools available for many different tasks (seehttp://www.kdnuggets.com/software/). However, even the best data miningtools addressed only a part of the overall business problem. Data still hadto be extracted from legacy databases, cleaned and preprocessed, and modelresults had to be delivered to the right channels and, most importantly, inte-grated with the specific application or business logic. Successful developmentof such applications in areas like direct marketing, telecom, and fraud detec-tion, led to emergence of data-mining-based vertical solutions. Examplesof such systems include HNC Falcon for credit card fraud detection, IBMAdvanced Scout for basketball game analysis, and NASD KDD Detectionsystem §[37].

8.2 New Trends and some Expectations for the future.

It is clear that nowadays, new technologies increased significantly our ca-pabilities of producing, collecting and storing data. Enormous quantities of

15

data are available to be analyzed in short times. This is an important hand-icap for either Statistics, Artificial Intelligence, Information Systems, Datavisualization,. . .

In this document, it has been shown that Statistical-like techniques aremainly useful for numerical variables and produce results based algebra, whileAI-like ones are mainly useful for qualitative ones and produce results basedon logics.

Describing the structure or obtaining knowledge from big sets of data isknown as a difficult task. Combination of Data analysis techniques (clus-tering among them) , inductive learning (knowledge-based systems), man-agement of data bases and multidimensional graphical representation mustproduce benefits on this line.

Several softwares start to provide data mining tools to support this sit-uations (Clementine, Intelligent Manager, . . . are some of the most famousnowadays). They mainly present juxtaposition of existent techniques, allow-ing comparison of results and selection of the better method for each case.An interesting didactic effort is WEKA (Waikato Environmental for Knowl-edge Analysis). This system provide a uniform interface, written in Java, tomany different learning algorithms. It was introduced by Witten and Frankin their book Data Mining in 1999 (see §[72]).

However, in real applications, it is usual to work with very complex do-mains §[29], such as mental disorders, sea sponges §[32], thyroids dysfunctions§[27] . . . , where data bases with both qualitative and quantitative variablesappear; and expert(s) have some prior knowledge (usually partial) of thestructure of the domain — which is hardly taken into account by clusteringmethods, and which is difficult to include in a Complete Knowledge Base.

Facing the automated knowledge discovery of ill-structured domains raisessome problems either from a machine learning or clustering point of view. Forexample Knowledge Base cannot be constructed since the domains are toocomplex and high quantities of implicit knowledge are managed. Classicallearning algorithms use to be NP-complete and cannot work with so big datamatrices. Statistical clustering algorithms need artificial transformation ofdata to manage simultaneously numerical and categorical variables and thisproduces problems with the interpretation of the results. In real world, mixedsituation between supervised and non supervised are found and no algorithmscan deal with them properly.

Real cooperation between techniques is urgent to improve the existent

16

methods and to explore new possibilities for facing such difficult analysis.Some works has already been done on this line. As an example, Clusteringbased on rules is a methodology developed in §[26] with the aim of finding thestructure of ill-structured domains. In our proposal, a cooperative combina-tion of clustering and inductive learning is focused to the problem of findingand interpreting special patterns (or concepts) from large data bases, in orderto extract useful knowledge to represent real-world domains. It gives bet-ter performance than traditional clustering algorithms or knowledge basedsystems approach in analyzing ill-structured domains. See §[26] for details.

The main idea is to take advantage of the partial prior knowledge thatthe expert can make explicit and to use an statistical clustering algorithm toanalyze those parts of the domain for which no prior knowledge is provided.At the end an unique dendrogram is built with all the objects and the resultsuse to guarantee the meaning of the resulting classes to the expert’s eyes.

This is only one example, but in our opinion this should be the trend.Juxtaposition of different kind of methods will not help us to face new sit-uations. Cooperative combinations of those methods is likely to be muchmore fruitful for building a new generation of real Knowledge Discovery andData Mining techniques, ready to face the more and more big and complexdatabases produced in the Information Society.

Finally some expectations about the future provided by the AI and Ma-chine Learning community and us.

Over the next 10years, we expect :

• to see continuing progress in faster CPU, bigger disks, along with ubiq-uitous and wireless net connectivity;

• standards to appear for different parts of the knowledge discovery pro-cess, and greatly facilitate industry growth. Already there are proposedstandards like CRISP for the data mining process, PMML for predictivemodel exchange, and Microsoft OLE DB;

• significant applications will appear in e-commerce, especially with real-time personalization. There will be significant use of intelligent agents;

• great progress in pharmaceuticals and new drugs enabled by knowledgediscovery and bioinformatics;

17

• there will be tighter integration of knowledge discovery modules witha database system, and most database systems will include a set ofdiscovery operations;

• and also that the data mining industry will overcome the hype stage,and will merge with the database industry.

References

[1] Adrians P. and Zantinge D. : Data Mining.Addison-Wesley, 1996.

[2] Aha D., Kibler D. and Albert M. : Instance-based learning algorithms.Machine Learning, 6. 1991. 37-66.

[3] Benzecri J.P. : L’analyse des donnees.Tome 1: La Taxinomie, Tome 2: L’analyse des correspondances.1a ed., 1973. Paris: Dunod. Paris, France. 1980.

[4] Brachman, R. and T. Anand. : The Process of Knowledge Discovery inDatabases: A Human-Centered Approach.In ” Advances in Knowledge Discovery and Data Mining”, ed. U.. Fayyad,G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI/MIT Press.1996.

[5] Brooks, R.A. : Intelligence without Reason.IJCAI91. 1991.

[6] Cawsey, A. : Databases and Artificial Intelligence 3. Artificial IntelligenceSegment. (1994) http://www.cee.hw.ac.uk/ alison/ai3notes/all.html

[7] Clark P. and Boswell R. : Rule induction with CN2, Some recent inprove-ments.In Proc. Fifth European Working Sesion on Learning, Springler. 1991.151-163.

[8] Clark P. and Niblett T. : The CN2 induction algorithm.Machine Learning 3(4). 1989. 261-283.

18

[9] Cover T.M. and Hart P.E. : Nearest Neighbor pattern classification.IEEE Transactions on Information Theory, 13. 1968. 21-27.

[10] Dasarathy B.V. : Nearest Neighbor (NN) Norms, NN pattern classifica-tion techniques.IEEE Computer Society Press, Los Alamitos, CA, US. 1990.

[11] De Raedt L. and Dehaspe L. : Clausal Discovery.Machine Learning, 26. 1997. 99-146.

[12] Diday, E. : La methode des nuees dynamiques.Stat. App. 19, n. 2. 19–34.

[13] Diday, E., Brito, P. and Mfoumou., M., : Modelling probabilistic databy conceptual pyramidal clustering.Proc. of 4th Int’l Work. on AI&Stats. Florida, US. 1993. 213-218.

[14] Dietterich T. G and Michalski R. S. : A comparative review of selectedmethods for learning from examples.In Michalski et al. [2]. 41-81.

[15] Droesbeke J.J. : Histoire de la Statistique.Universite Libre de Bruxelles.Universitaires de France, Paris, France. 1990.

[16] Dudani S.A. : The distance-weighted k-nearest neighbor rule.IEEE Transactions on Systems, Man, and Cybernetics, 6(4). 1975. 325-327.

[17] Dzeroski S. and Flach P. : Network of Excellence in Inductive Logic Pro-gramming ILPnet2 Funded by the European Commission under contractINCO 977102 (1997) http://www.cs.bris.ac.uk/ ILPnet2/

[18] Everitt, B. : Cluster analysis.ondon: Heinemann Ed. Books Ltd. London. 1981.

[19] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth. : From Data Miningto Knowledge Discovery in Databases (a survey).AI Magazine, 17(3): Fall 1996. 1996. 37-54.

19

[20] Fayyad, U. : From Data Mining to Knowledge Discovery: An overview.Advances in KD and DM, Fayyad, AAAI/MIT. 1996.ISBN 0-262-56097-6.

[21] Fisher D.: Knowledge Acquisition Via Incremental Conceptual Cluster-ingOriginally published in Machine Learning 2, Kluwer Academic Publish-ers, Boston 1987.139-172.

[22] Fisher, D. and Pazzani, M. : Concept formation: Knowledge and expe-rience in unsupervised learningMorgan Kauffmann, San Mateo, CA, US. 1991.

[23] Fisher D.H. and Schlimmer J. C. : Models of Incrementals ConceptLearning : A coupled research proposal.Computer Science Departament, Vanderbilt University.Nashville, TN. USA. (1997). 10.http://cswww.vuse.vanderbilt.edu/ dfisher/

[24] Fix E. and Hodges J.L. : Discriminatory analysis. Non parametric dis-crimination. Consistency properties.Technical Report 4, US air force school of aviation medicine. TX, US.1957.

[25] French S. : Decision Theory.Ellis Horwood, Chichester. 1986.

[26] Gibert, K. : L’us de la Informacio Simbolica en l’Automatitzacio delTractament Estadıstic de Dominis poc Estructurats.In the Statistics and Operations Research phD. thesis.UPC, Barcelona, Spain. 1994.

[27] Gibert. K. and Sonicki Z. : Classification Based on Rules and ThyroidsDysfunctions.Applied Stochastic Models in Business and Industry.Appl. Stochastic Models Bus. Ind. 15, 1999. 319-324.John Wiley & Sons.

[28] Gibert. K., Aluja T. and Cortes C.U.: Knowledge Discovery with Clus-tering Based on Rules. Interpreting Results.

20

Principles of Data Mining and Knowledge Discovery. J.M. Quafafou Eds.Lecture Notes in Artificial Intelligence, 1510, Springer-Verlag. Berlin.1998. 83-92.

[29] Gibert, K. and Cortes, C.U. : Clustering based on rules and knowledgediscovery in ill-structured domainsComputacion y Sistemas, Mexico. 1998.

[30] Gibert, K. and Cortes C.U. : Weighing quantitative and qualitativevariables in clustering methods.Mathware and Soft Computing. Vol. IV, n.3. 1997. 251-266.Seccio de Matematiques i Informatica.Escola Tecnica Superior d’Arquitectura.Universitat Politecnica de Catalunya.

[31] Gibert, K. and Cortes C.U. : Combining a knowledge-based systemand a clustering method for a construction of models in ill-structureddomains.P. Cheeseman and R. W. Oldford (eds), Selecting models from Data:Artificial Intelligence and Statistics IV, Lect. Not. in Stats. 89.Springer-Verlag, New York, N.Y. US. 1994. 351-360.

[32] Gibert K. and Cortes, C.U. : KLASS, Una herramienta estadıstica parala creacion de prototipos en dominios poco estructurados.proc. IBERAMIA-92. Noriega Eds., Mexico. 1992. 483-497.

[33] Hanson S.: Conceptual Clustering and Categorization.Machine Learning (An Artificial Intelligence Approach) Volume III, Mor-gan Kaufmann, San Mateo. 1990. 235-268.

[34] Holland J.H., Holoyoak K.J., Nisbett R.E., and Thagard P.R : Induc-tion: process of inference, learning and discovery.Computational models of cognition and perception. MIT Press, Cam-bridge,MA,US. 1986.

[35] Holsheimer M. and Siebes A.P.J.M. : Data Mining, the search forknowledge in databases. Computer Science/Departament of Algorith-mics and Architecture. Centrum voor Wiskunde en Informatica. Ams-terdam, The Netherlands. Report CS-R9406, ISSN 0169-118X. (1994)http://www.cwi.nl/static/publications/reports/reports.html

21

[36] Inselberg, A. : The plane with parallel coordinates.The Visual Computer, 1. 1985. 69-91.

[37] Kirkland, J. : The NASD Regulation Advanced-Detection System.(ADS) , AI Magazine 20(1): Spring 1999. 1999. 55-67.

[38] Kononenko I. : Semi naive Bayesian classifier.In Kodratoff, Y. (ed.) Proc. European Working Session on Learning 91,Porto, 1991. Springer. 1991. 206-219.

[39] Kononenko I. : Inductive and Bayesian learning in medical diagnosis.Applied Artificial Intelligence, 7. 1993. 317-337.

[40] Krose, B. and Van der Smagt, P.: An Introduction to Neu-ral Networks. fifth edn, University of Amsterdam.(1993) 13, 57-73.ftp://ftp.fwi.uva.nl//pub/computer-systems/...neuro-intro.ps.gz

[41] Lavrac N. and Dzeroski S. : Inductive Logic Programming, Techniquesand Applications.Ellis Horwood, Chichester. 1994.

[42] Lebart, L. : Traitement statistique des donnees.Dunod, Paris.

[43] Lenat, D. B. : Cyc: A Large-Scale Investment in Knowledge Infrastruc-ture.Communications of the ACM 38, no. 11. 1995.

[44] Lopez de Mantaras R. : A distance based attribute selection measurefor decision tree induction.Machine Learning 6. 81-92.

[45] Lopez de Mantaras R. : Trends in Knowledge Engineering.proc TECCOMP-91. Mexico. 1991. 20-21.

[46] Lopez de Mantaras R. y Crespo J.J. : El problema de la selecion deatributos en el aprendizaje inductivo : nueva propuesta y estudio expe-rimental.proc. IBERAMIA-90. Limusa-Noriega, Mexico. 1990. 259-271.

22

[47] Lucas P.J.F. : Logic engineering in medicine.The knowledge engineering review 10(2). Cambridge University Press.1995. 153-179.

[48] MacQueen J. : Some Methods for Classification and Analysis of Multi-variate Observations.proc. 5th Berkeley Symp.Berkeley, CA, US. 1965. 281-297.

[49] Michalski, R.S. : A theory and methodology of inductive learning.In Michalski et al. (see [50]). 83-134.

[50] Michalski, R.S. Carbonell J.G. and Mitchell T.M. : Machine Learning,an Artificial Intellegence approach.v2. Morgan Kufmann, San Mateo, CA, US. 1986.

[51] Michalski R. S. and Kaufman K. A. : Data Mining and KnnowledgeDiscovery : A Review of Issues and a Multistrategy Approach. Chapter 2of Machine Learning and Data Mining : Methods and Applications. JohnWiley and Sons publishers. (1997).

[52] Michalski R. and Stepp R.: Learning from Observation Conceptual Clus-tering.In Machine Learning: An artificial intelligence approach, San Mateo Mor-gan Kaufmann, 1983. 331-363.

[53] Michie D., Spiegelhalter D.J. and Taylor C.C. : Machine learning, neuraland statistical classification.Ellis Horwood. 1994.

[54] Minsky, M. : Steps Toward Artificial Intelligence.Marvin Minsky, Proc. IRE 49, Jan. 1961. 8-30.

[55] Minsky, M. : A Selected Descriptor-Indexed Bibliography to the Liter-ature on Artificial Intelligence.Feigenbaum and Feldman book : Computers and Thought. McGraw-Hill,New York, NY. 1963. 453-523.

[56] Mitchell T. M. : Machine Learning.Chapter 1. Mc Graw Hill. 1997.

23

[57] Muggleton S. : Inverse entailment and Progol.New generation computing, special issue on inductive logic programming,13(3-4). 1995. 245-286.

[58] Negrete J. : El Nopal de la Inteligencia Artificial.Makatzina. Lecturas en IA. Revista aperiodica. Numero III.Veracruz, Mexico. 1994.http://www.mia.uv.mx/ jnegrete/pub/NuevosMakatzina/Mk3.html

[59] Newell, A., Shaw, J.C. and Simon H. : Empirical Explorations with theLogic Theory Machine.Western Joint Computer Conference 15, 1957. 218-329.

[60] Pearl J. : Probabilistic Reasoning in Intelligent Systems, Networks ofPlausible Inference.Morgan Kauffman, San Mateo, CA, US. 1988.

[61] Quinlan J. R. : Induction of Decision Trees.Machine Learning, Volume 1, 1986. 81-106.

[62] Quinlan, J.R. : Generating Production rules from decision trees.In Proceedings of 10th International Joint Conference on Artificial Intel-ligence. Milan, Italy. 1987. 304-307.

[63] Quinlan, J.R. : Learning logical definitions from relations.Machine Learning, Volume 5(3). 1990. 239-266.

[64] Quinlan, J.R. : C4.5, programs for machine learning.Morgan Kauffman, San Mateo, CA, US. 1993.

[65] Richeldi M. and Rossotto M. : Class driven statistical discretization ofcontinuous attributes.In Lavrac N., Wrobel, S. (eds.), Machine Learning: Proc. ECML95,Springer. 1995. 335-342.

[66] Roux M. : Algorithmes de classification.Paris: Masson, Paris, France. 1985.

[67] Rumelhart D.E. and McClelland J.L. : Parallel Distributed Processing,Vol 1: Foundations. MIT press, Cambridge, M.A.,US. 1986.

24

[68] Sokal, R.R. and Sneath, P.H.A. : Principles of numerical taxonomy.W. H. Freeman & Co. San Francisco, CA, US. 1963.

[69] Tirri H., Kontkanen P. and Myllymki : A Bayesian Framework for Case-Based Reasoning.Proceedings of the 3rd european Workshop on Case-Based Reasoning.Switzerland. 1996.

[70] Volle M. : Analyse des donnees.Paris: Economica, Paris, France. 1985.

[71] Weiss S.M. and Kulikowski C.A. : Computer Systems that learn. Mor-gan Kauffman, San Mateo, CA, US. 1991.

[72] Witten I. H. and Frank E. : Data Mining.Practical Machine Learning Tools and Techniques with Java Implemen-tations.Morgan Kaufmann Publishers. (1999) 265.

25

artificial intelligence vrs statistics

Documents