data mining tool reviews march 2011.pdf

Advanced Review

Data mining toolsRalf Mikut∗ and Markus Reischl

The development and application of data mining algorithms requires the use ofpowerful software tools. As the number of available tools continues to grow, thechoice of the most suitable tool becomes increasingly difficult. This paper attemptsto support the decision-making process by discussing the historical developmentand presenting a range of existing state-of-the-art data mining and related tools.Furthermore, we propose criteria for the tool categorization based on differentuser groups, data structures, data mining tasks and methods, visualization andinteraction styles, import and export options for data and models, platforms, andlicense policies. These criteria are then used to classify data mining tools intonine different types. The typical characteristics of these types are explained anda selection of the most important tools is categorized. This paper is organizedas follows: the first section Historical Development and State-of-the-Art highlightsthe historical development of data mining software until present; the criteria tocompare data mining software are explained in the second section Criteria forComparing Data Mining Software. The last section Categorization of Data MiningSoftware into Different Types proposes a categorization of data mining softwareand introduces typical software tools for the different types. C© 2011 John Wiley & Sons,Inc. WIREs Data Mining Knowl Discov 2011 00 1–13 DOI: 10.1002/widm.24

HISTORICAL DEVELOPMENTAND STATE-OF-THE-ART

D ata mining has a long history, with strongroots in statistics, artificial intelligence, machine

learning, and database research.1, 2 The word ‘datamining’ can be found relatively early, as in the articleof Lovell,3 published in the 1980s. Advancements inthis field were accompanied by development of relatedsoftware tools, starting with mainframe programs forstatistical analysis in the early 1950s, and leading toa large variety of stand alone, client/server, and web-based software as today’s service solution.

Following the original definition given in Ref 1,data mining is a step in the knowledge discoveryfrom databases (KDD) process that consists of ap-plying data analysis and discovery algorithms toproduce a particular enumeration of patterns (ormodels) across the data. In that same article, KDD isdefined as the nontrivial process of identifying valid,novel, potentially useful, and ultimately understand-able patterns in data. Sometimes, the wider KDD def-inition is used synonymously for data mining. Thiswider interpretation is especially popular in the con-text of software tools because most such tools sup-

∗Correspondence to: ralf.mikut.kit.edu

Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz1, 76344 Eggenstein-Leopoldshafen, GERMANY

DOI: 10.1002/widm.24

port the complete KDD process and not just a singlestep.

Today, a large number of standard data min-ing methods are available, (see Refs 4 and 5 fordetailed descriptions). From a historical perspective,these methods have different roots. One early groupof methods was adopted from classical statistics: thefocus was changed from the proof of known hypothe-ses to the generation of new hypotheses. Examplesinclude methods from Bayesian decision theory, re-gression theory, and principal component analysis.Another group of methods stemmed from artificial in-telligence - like decision trees, rule-based systems, andothers. The term ‘machine learning’ includes methodssuch as support vector machines and artificial neu-ral networks. There are several different and some-times overlapping categorizations; for example, fuzzylogic, artificial neural networks, and evolutionary al-gorithms, which are summarized as computationalintelligence.6

The typical life cycle of new data mining meth-ods begins with theoretical papers based on in-house software prototypes, followed by public oron-demand software distribution of successful algo-rithms as research prototypes. Then, either specialcommercial or open source packages containing afamily of similar algorithms are developed or the al-gorithms are integrated into existing open source orcommercial packages. Many companies have tried topromote their own stand alone packages, but only

Volume 00, January /February 2011 1c© 2011 John Wi ley & Sons , Inc .

Advanced Review wires.wiley.com/widm

few have reached notable market shares. The life cy-cle of some data mining tools is remarkably short.Typical reasons include internal marketing decisionsand acquisitions of specialized companies by largerones, leading to a renaming and integration of prod-uct lines.

The largest commercial success stories resultedfrom the step-wise integration of data mining methodsinto established commercial statistical tools. Compa-nies such as SPSS, founded in 1975 with precursorsfrom 1968, or SAS, founded in 1976, have been of-fering statistical tools for mainframe computers sincethe 1970s. These tools were later adapted to personalcomputers and client/server solutions for larger cus-tomers. With the increasing popularity of data min-ing, algorithms such as artificial neural networks ordecision trees were integrated in the main productsand specialized data mining companies such as Inte-grated Solutions Ltd. (acquired in 1998 by SPSS) wereacquired to obtain access to data mining tools such asClementine. During these periods, renaming of toolsand company mergers played an important role inhistory; for example, the tool Clementine (SPSS) wasrenamed as PASW Modeler, and is now available asIBM SPSS Modeler after the acquisition of SPSS byIBM in 2009. In general, tools of this statistical branchare now very popular for the user groups in businessapplication and applied research.

Concurrently, many companies offering busi-ness intelligence products have integrated data miningsolutions into their database products; one exampleis Oracle Data Mining (established in 2002). Manyof these products are also a product of the acquisitionand integration of specialized data mining companies.

In 2008, the worldwide market for business in-telligence (i.e., software and maintenance fees) was7.8 billion USD, including 1.5 billion USD in so-called ‘advanced analytics’, containing data miningand statistics.7 This sector has grown 12.1% be-tween 2007 and 2008, with large players includingcompanies such as SAS (33.2%, tool: SAS EnterpriseMiner), SPSS (14.3%, since 2009, an IBM company;tool: IBM SPSS Modeler), Microsoft (1.7%, tool: SQLServer Analysis Services), Teradata (1.5%, tool: Tera-data Database, former name TeraMiner), and TIBCO(1.4%, tool: TIBCO Spotfire).

Open-source libraries have also become verypopular since the 1990s. The most prominent exam-ple is Waikato Environment for Knowledge Analy-sis (WEKA), see Ref 8. WEKA started in 1994 asa C++ library, with its first public release in 1996.In 1999, it was completely rebuilt as a JAVA pack-age; since that time, it has been regularly updated. Inaddition, WEKA components have been integrated

in many other open-source tools such as Pentaho,RapidMiner, and KNIME.

A large group of research prototypes are basedon script-oriented mathematical programs such asMATLAB (commercial) and R (open source). Suchmathematical programs were not originally focusedon data mining, but contain many useful mathemati-cal and visualization functions that support the im-plementation of data mining algorithms. Recently,graphical user interfaces such as those utilized for R(e.g., Rattle) and Matlab (e.g., Gait-CAD, Establishedin 2006) can be used as integration packages (INT)for many single, open-source algorithms.

As the number of available tools continues togrow, the choice of one special tool becomes increas-ingly difficult for each potential user. This decision-making process can be supported by criteria for thecategorization of data mining tools. Different catego-rizations of tools were proposed in Refs 9–12. The lasttwo comprehensive criteria-based surveys date backto 1999, covering 43 software packages in Ref 9, and2003, with 33 tools in Ref 12 (a regularly updatedExcel table is available on request from the same au-thor with 63 tools in 2009). In addition, smaller re-views have been published, containing 12 open-sourcetools,13 eight noncommercial tools,14 nine commer-cial tools,10 and five commercial tools using bench-mark datasets.15

In the past 10–15 years, data mining has be-come a technology in its own right, is well establishedalso in business intelligence (BI), and continues to ex-hibit steadily increasing importance in technology andlife sciences sectors. For example, data mining was akey factor supporting methodological breakthroughsin genetics.16 It is a promising technology for fu-ture fields such as text mining and semantic searchengines,17 learning in autonomous systems—as withhumanoid robots18 and cars, chemoinformatics19 andothers.

Various standardization initiatives have been in-troduced for data mining processes, data and modelinterfaces—as with Cross Industry Standard Pro-cess for Data Mining for industrial data mining,20

and approaches focused on clinical and biologicalapplications.21 A survey of such initiatives is pro-vided in Ref 22, and a large variety of standard datamining methods are described in comprehensive stan-dard text books;4, 5 however, new methods, especiallyfor data streams,23 extremely large datasets, graphmining,24, 25 text mining,17 and others have been pro-posed in the last few years. In the near future, meth-ods for high-dimensional problems such as imageretrieval26 and video mining27 will also be optimizedand embedded into powerful tools.

2 Volume 00, January /February 2011c© 2011 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Data mining tools

TABLE 1 Maximum Dimensions of Datasets for Different Types of Problems

Data Dim. Structure for Each of the N Examples

Feature table 2 s features (e.g., age and income)Texts 2 frequency of words or n-grams (vector-space approach)Time series 3 s time series with K time samplesSequences 3 s sequences of length L (e.g., mass spectrograms and genes)Images 4 s images with pixelsGraphs 4 s graphs with adjacency matrixes3D images 5 s images with pixels and slicesVideos 5 s videos containing images with pixels and K time samples3D videos 6 like videos, but with additional slices

Dim., maximum dimensionality; s, number of features; N, number of examples; K, number of samples in a time series. Lower dimensionsof the dataset can occur for problems with only one feature s = 1 resp. one example (N = 1).

CRITERIA FOR COMPARING DATAMINING SOFTWARE

SurveyIn the following, different criteria for comparison ofdata mining software are introduced. These criteriaare based on user groups, data structures, data min-ing tasks and methods, import and export options,and license models. A detailed overview about thedifferent tools is given later in this paper and as anExcel table in the additional material; however, somespecific information about tools is discussed if a spe-cific tool is unique to some aspects of the proposedcriteria. The complete list of tools is provided towardthe end of this paper.

User GroupsThere are many different data mining tools available,which fit the needs of quite different user groups:

• Business applications: This group uses datamining as a tool for solving commerciallyrelevant business applications such as cus-tomer relationship management, fraud detec-tion, and so on. This field is mainly covered bya variety of commercial tools providing sup-port for databases with large datasets, anddeep integration in the company’s workflow.

• Applied research: A user group that appliesdata mining to research problems, for ex-ample, technology and life sciences. Here,users are mainly interested in tools with well-proven methods, a graphical user interface(GUI), and interfaces to domain-related dataformats or databases.

• Algorithm development: Develops new datamining algorithms, and requires tools to both

integrate its own methods and compare thesewith existing methods. The necessary toolsshould contain many concurrent algorithms.

• Education: For education at universities, datamining tools should be very intuitive, witha comfortable interactive user interface, andinexpensive. In addition, they should allowthe integration of in-house methods duringprogramming seminars.

Data StructuresAn important criterion is the dimensionality of the un-derlying raw data in the processed dataset (Table 1).The first data mining applications were focused onhandling datasets represented as two-dimensional fea-ture tables. In this classical format, a dataset consistsof a set of N examples (e.g., clients of an insurancecompany) with s features containing real values orusually integer-coded classes or symbols (e.g., income,age, number of contracts, and alike). This format issupported by nearly all existing tools. In some cases,the dataset can be sparse, with only a few nonzerofeatures such as a list of s shopping items for N differ-ent customers. The computational and memory effortcan be reduced if a tool exploits this sparse structure.

Some structured datasets are characterized bythe same dimensionality. As an example, sample doc-uments in most text mining problems are representedby the frequency of words or so-called n-grams (agroup of n subsequent characters in a document).28

The most prominent format having a higher di-mensionality contains time series as elements, leadingto dataset dimensions between one (i.e., only one ex-ample of a time series with K samples) and three (i.e.,N different examples of s-dimensional vector time se-ries with K samples). Typical tasks are forecasting of



future values, finding typical patterns in a time se-ries or finding similar time series by clustering. Theanalysis of time series plays an import role in manydifferent applications, including prediction of stockmarkets, forecasting of energy consumption and othermarkets, and quality supervision in production, andis also supported by most data mining tools.

With a similar dimensionality, different kinds ofstructured data exist such as gene sequences (spatialstructure), spectrograms or mass spectrograms (struc-tured by frequencies or masses), and others. Only afew tools support these types of structured data ex-plicitly, but some tools for time series analysis can berearranged to cope with these problems.

A more recent trend is the application of datamining methods for images and videos.26, 27 The mainchallenge is the handling of extremely large rawdatasets, up to gigabytes and terabytes, caused by thehigh dimensionality of the examples. Typical applica-tions are microscopic images in biology and medicine,camera-based sensors in quality control and robotics,biometrics, and security. Such datasets must be splitinto metadata—with links to image and video fileshandled in a main dataset and files—which containthe main part of the data. Until now, these problemswere normally solved using a combination of tools:the initial tool (e.g., ImageJ and ITK) would pro-cess the images or videos, resulting in segmented im-ages and extracted features describing the segments;a second tool would solve data mining problems han-dling the extracted features as a classical table or timeseries.

Another format leading to image-like dimen-sions includes graphs that can be represented asadjacency matrices, describing the connection be-tween different nodes of a graph. Graph mininghas powerful applications,24, 25 such as characteriz-ing social networks and chemical structures; however,only a few such tools exist, including Pegasus andProximity.

Tasks and MethodsThe most important tasks in data mining are

• supervised learning, with a known outputvariable in the dataset, including

(a) classification: class prediction, withthe variable typically coded as an in-teger output;

(b) fuzzy classification: with gradualmemberships with values in-between0 and 1 applied to the differentclasses;

(c) regression: prediction of a real-valuedoutput variable, including specialcases of predicting future values ina time series out of recent or pastvalues;

• unsupervised learning, without a known out-put variable in the dataset, including

(a) clustering: finds and describes groupsof similar examples in the data usingcrisp of fuzzy clustering algorithms;

(b) association learning: finds typicalgroups of items that occur frequentlytogether in examples;

• semisupervised learning, whereby the outputvariable is known only for some examples.

Each of these tasks consists of a chain of low-level tasks. Furthermore, some low-level tasks can actas stand-alone tasks; for example, by identifying ina large dataset elements that possess a high similar-ity to a given example. Examples of such low-leveltasks are:

• data cleaning (e.g., outlier detection);

• data filtering (e.g., smoothing of time series);

• feature extraction from time series, images,videos, and graphs (e.g., consisting of seg-mentation and segment description for im-ages, characteristic values such as communitystructures in graphs);

• feature transformation (e.g., mathematicaloperations, including logarithms, dimensionreduction by linear or nonlinear combina-tions by a principal component analysis,factor analysis or independent componentanalysis);

• feature evaluation and selection (e.g., by filteror wrapper methods);

• computation of similarities and detection ofthe most similar elements in terms of exam-ples or features (e.g., by k-nearest-neighbor-methods and correlation analysis);

• model validation (cross validation, bootstrap-ping, statistical relevance tests and complexitymeasures);

• model fusion (mixture of experts); and

• model optimization (e.g., by evolutionary al-gorithms).

For almost all of these tasks, a large varietyof classical statistical methods—including classifiers



using estimated probability density functions, fac-tor analysis and others, and newer machine learn-ing methods—such as artificial neural networks, fuzzymodels, rough sets, support vector machines, decisiontrees, and random forests, are available. In addition,optimization models such as evolutionary algorithmscan assist with the identification of model structuresand parameters. The related methods are described insurvey articles29 or textbooks4, 5 and are not summa-rized in this paper.

Not all of the data mining methods are availablein all software tools. The following list contains a sub-jective evaluation of the frequency with which specificmethods are incorporated in the different tools:

• Frequent: classifiers using estimated probabil-ity density functions, correlation analysis, sta-tistical feature selection, and relevance tests;

• In many tools: decision trees, clustering, re-gression, data cleaning, data filtering, featureextraction, principal component analysis, fac-tor analysis, advanced feature evaluation andselection, computation of similarities, artifi-cial neural networks, model cross validation,and statistical relevance tests;

• In some tools: fuzzy classification, associa-tion learning and mining frequent item sets,independent component analysis, bootstrap-ping, complexity measures, model fusion,support vector machines, k-nearest-neighbor-methods, Bayesian networks, and learning ofcrisp rules;

• Infrequent: random forests30 (contained inWaffles, Random Forests, WEKA, and allof its derivatives), learning of fuzzy systems(contained in KnowledgeMiner, See5, andGait-CAD), rough sets31 (in ROSETTA, andRseslibs), and model optimization by evolu-tionary algorithms14 (in KEEL, ADaM, andD2K).

Interaction and VisualizationThere are three main types of interaction between auser and a data mining tool:

• pure textual interface using a programminglanguage—difficult to handle, but easily au-tomated;

• graphical interface with a menu structure—easy to handle, but not so easily automated;and

• graphical user interface where the user selects‘function blocks’ or algorithms from a paletteof choices, defines parameters, places themin a work area, and connects them to createcomplete data mining streams or workflows;a good compromise, but difficult to handle forlarge workflows.

Mixtures of these forms arise if macros of menuitems can be recorded for workflows or if additionalblocks in a workflow can be implemented using aprogramming language. Automation (scripting) is ex-tremely important for routine tasks, especially withlarge datasets, because the workload of the user isreduced. Almost all tools provide powerful visualiza-tion techniques for the presentation of data miningresults; particularly tools for business application andapplied research, which are able to generate completereports containing the most important results in areadable form for users lacking explicit data miningskills. Interactive methods can support an explorativedata analysis. An example is a method called brush-ing that enables the user to select specific data pointsin a figure or subsets of data (e.g., nodes of a decisiontree) and highlight these data points in other plots.

Import and Export of Data and ModelsThe ease with which data and models can be importedand exported among different software tools plays acrucial role in the functionality of data mining tools.First, the data are normally generated and hostedfrom different sources such as databases or softwareassociated with measurement devices. In business ap-plications, interfaces to databases such as Oracle orany database supporting the Structured Query Lan-guage (SQL) standard are the most common meansof importing data. Because almost all other nondatamining tools support export as text or excel files,formats such as CSV (comma separated values) arefrequently used to import formats with data miningtools. In addition, almost all software have propri-etary binary or textual files, and exchanges formatsfor data and models, e.g., Attribute-Relation File For-mat in WEKA (WEKA standard).

In order to import and export developed mod-els as components in other processes and systems,the XML-based standard PMML32 was developed bythe Data Mining Group (http://www.dmg.org) andis supported by many companies such as IBM andSAS. Another standard initiative is the Object Link-ing and Embedding Database (OLEDB, sometimeswritten as OLEDB or OLE-DB) for data mining, anAPI (Application Programming Interface) designed



by Microsoft to access different types of data storedin a uniform manner (http://msdn.microsoft.com/en-us/library/ms146608.aspx). OLEDB is a set of in-terfaces implemented using the Component ObjectModel (COM). For data exchange among differ-ent tools, another initiative deals with Java Specifi-cation Requests for data mining: versions 1.0 (JSR73, final release in 2004: http://www.jcp.org/en/jsr/detail?id=73) and 2.0 (JSR 247, public review aslast activity in 2006: http://www.jcp.org/en/jsr/detail?id=247) define an extensible Java API for data miningsystems. The consortium includes many related com-panies, such as Oracle, SAS, SPSS (now IBM), SAP,and others; recent overviews can be found in Refs 33and 34. Another interesting feature is the export ofan executable runtime version of developed models.Often, they do not require a more expensive develop-ment license and can be run free of charge, or at leastwith a cheaper runtime license.

PlatformsData mining tools can be subdivided into stand-alone and client/server solutions. Client/server solu-tions dominate, especially in products designed forbusiness users. They are available for different plat-forms, including Windows, MAC OS, Linux, or spe-cial mainframe supercomputers. There is a growingnumber of JAVA-based systems that are platform-independent for users in research and appliedresearch.

Further expected trends are an increasing num-ber of web interfaces providing data mining as SAAS(software as a service, with tools like Data Applied)and a stronger support of client/server-based datamining solutions on grids (tool ADaM, e.g., see, stepsto a standardization in Ref 35); however, both trendshave the potential risk of hurting privacy policies be-cause the protection of data is difficult and many com-panies are very careful with sensitive data.

LicensesThere exists a wide variety of data mining tools withcommercial and open-source licenses. This is partic-ularly true in the business application user group,where commercial software is very attractive dueto high software stability, good coupling with othercommercial tools for data warehouses, included soft-ware maintenance, and the possibility of user train-ing for sophisticated topics. For all other user groups,there is a strong trend toward open-source software,but different types of licenses exist for this (e.g., see,survey in Ref 36). The main advantages of open-

source software are faster bug fixes and method-ological improvements, potential for integration withother tools, the existence of developer and user com-munities, faster adoption of methods to other inno-vative applications, and the fair comparison of newdata mining algorithms with alternative ones. Theseadvantages attract mainly users of applied research,development, and education; however, open-sourcetools are beginning to migrate even into business usergroups,37 particularly when additional commercialservices such as training or maintenance are offered(e.g., Pentaho).

The most popular type of open-source licenses isthe GNU General Public License of the Free SoftwareFoundation (GNU-GPL or GPL: http://www.fsf.org).It permits free redistribution, integration in otherpackages, and modification of the software as longas all subsequent users receive the same level of free-dom (so-called ‘copy left’). This restriction guaranteesthat all software containing GNU-GPL componentsmust be licensed under GNU-GPL. Weaker forms arelicenses that are free for academic use, but not forbusiness users.

Mixed forms of licenses occur especially if open-source software is used to expand commercial toolssuch as Matlab.

The Excel table (see, Section Supplementary In-formation) lists 195 recent tools (119 commercialtools, 67 open source tools, and nine tools with mixedlicense models).

CATEGORIZATION OF DATA MININGSOFTWARE INTO DIFFERENT TYPES

Following the criteria from the previous section, dif-ferent types of similar data mining tools can be found.The typical characteristics of these types are explainedin this section. Matching of the different types anduser groups and the number of recent tools are sum-marized in Table 2. In addition, for commercial datamining tools, related tools and their group member-ship are summarized in different tables for commer-cial (Tables 3 and 4), free, and open-source data min-ing tools (Table 5). In these tables, very popular toolsare marked in bold. The popularity was measured by

• the 20 most frequently used tools for realprojects from ‘Data Mining/Analytic ToolsUsed Poll 2010’ of KDnuggets with 912 voters(http://www.kdnuggets.com/polls/2010/data-mining-analytics-tools.html); [top 10 toolswere RapidMiner, R, Excel (here ignored),KNIME, Pentaho/WEKA, SAS, MATLAB,



TAB

LE2

Mat

chin

gBe

twee

nDi

ffere

ntU

ser

Gro

ups

and

Tool

Type

sw

ithN

umbe

rof

Rece

ntTo

ols

inth

eEx

celT

able

(see

,Sec

tion

Supp

lem

enta

ryIn

form

atio

n,to

ols

belo

ngin

gto

two

cate

gorie

sar

eco

unte

dtw

ice)

Type

sDa

taM

inin

gSu

ites

Busi

ness

Inte

llige

nce

Pack

ages

Mat

hem

atic

alPa

ckag

esIn

tegr

atio

nPa

ckag

esEx

tens

ions

Data

Min

ing

Libr

arie

sSp

ecia

litie

sRe

sear

chPr

otot

ypes

Solu

tions

Num

bero

fRec

entT

ools

4616

58

1020

5617

19

Busi

ness

appl

icat

ions

++

−0

0−

0−

0Ap

plie

dre

sear

ch+

−+

+0

00

0+

Algo

rithm

deve

lopm

ent

−−

++

−+

0+

−Ed

ucat

ion

+−

00

+−

+−

0

Eval

uatio

n,+:

espe

cial

lyus

eful

,0:l

ess

usef

ul,−

:not

usef

ul.

IBM SPSS Statistics, IBM SPSS Modeler, andMicrosoft SQL Server];

• all main products of vendors with more than1% market share in the section ‘AdvancedAnalytics Tools’ from Ref 7; and

• the most popular image processing tools (ITKand ImageJ) from the author’s own experi-ence to cover this field.

In this paper, the following nine types are pro-posed:

• Data mining suites (DMS) focus largely ondata mining and include numerous meth-ods. They support feature tables and time se-ries, while additional tools for text miningare sometime available. The application fo-cus is wide and not restricted to a specialapplication field, such as business applica-tions; however, coupling to business solu-tions, import and export of models, report-ing, and a variety of different platforms arenonetheless supported. In addition, the pro-ducers provide services for adaptation of thetools to the workflows and data structures ofthe customer. DMS is mostly commercial andrather expensive, but some open-source toolssuch as RapidMiner exist. Typical examplesinclude IBM SPSS Modeler, SAS EnterpriseMiner, Alice d’Isoft, DataEngine, DataDetec-tive, GhostMiner, Knowledge Studio, KXEN,NAG Data Mining Components, PartekDiscovery Suite, STATISTICA, and TIBCOSpotfire.

• Business intelligence packages (BIs) have nospecial focus to data mining, but include basicdata mining functionality, especially for sta-tistical methods in business applications. BIsare often restricted to feature tables and timeseries, large feature tables are supported. Theyhave a highly developed reporting function-ality and good support for education, han-dling, and adaptation to the workflows of thecustomer. They are characterized by a strongfocus on database coupling, and are imple-mented via a client/server architecture. MostBI softwares are commercial (IBM Cognos 8BI, Oracle Data Mining, SAP Netweaver Busi-ness Warehouse, Teradata Database, DB2Data Warehouse from IBM, and PolyVista),but a few open-source solutions exist (Pen-taho).

• Mathematical packages (MATs) have nospecial focus on data mining, but provide a



TABLE 3 List of Commercial Tools (Part 1)

Tool Type Link

ADAPA (Zementis) DMS www.zementis.comAlice (d’Isoft) DMS www.alice-soft.comBayesia Lab SPEC www.bayesia.comC5.0 SPEC www.rulequest.comCART SPEC www.salford-systems.comData Applied DMS data-applied.comDataDetective DMS www.sentient.nl/?ddenDataEngine DMS www.dataengine.deDatascope DMS www.cygron.huDB2 Data Warehouse BI www.ibm.com/software/data/infosphere/warehouseDeltaMaster BI www.bissantz.com/deltamasterForecaster XL EXT www.alyuda.comGhostMiner DMS www.fqs.pl/business intelligence/products/ghostminerIBM Cognos 8 BI BI www.ibm.com/software/data/cognos/data-mining-tools.htmlIBM SPSS Modeler DMS www.spss.com/software/modeling/modelerIBM SPSS Statistics MAT www.spss.com/software/statisticsiModel DMS www.biocompsystems.com/products/imodelInfoSphere Warehouse BI www.ibm.com/software/data/infosphere/warehouseJMP DMS www.jmpdiscovery.comKnowledgeMiner SPEC www.knowledgeminer.netKnowledgeStudio DMS www.angoss.comKXEN DMS www.kxen.comMagnum Opus SPEC www.giwebb.comMATLAB MAT www.mathworks.comMATLAB Neural Network Toolbox EXT www.mathworks.comModel Builder DMS www.fico.comModelMAX SOL www.asacorp.com/products/mmxover.jsp

Very popular tools are marked in bold letters.

large and extendable set of algorithms andvisualization routines. They support featuretables, time series, and have at least importformats for images. The user interaction of-ten requires programming skills in a scriptinglanguage. MATs are attractive to usersin algorithm development and applied re-search because data mining algorithms canbe rapidly implemented, mostly in the formof extensions (EXT) and research prototypes(RES). MAT packages exist as commercial(MATLAB and R-PLUS) or open-source tools(R, Kepler). In principle, table calculationsoftware such as Excel may also be catego-rized here, but it is not included in this pa-per. Most tools are available for differentplatforms but have weaknesses in databasecoupling.

• Integration packages (INTs) are extendablebundles of many different open-source algo-rithms, either as stand-alone software (mostly

based on Java; as KNIME, the GUI-version ofWEKA, KEEL, and TANAGRA) or as a kindof larger extension package for tools fromthe MAT type (such as Gait-CAD, PRToolsfor MATLAB, and RWEKA for R). Importand export support standard formats, butdatabase support is quite weak. Most toolsare available for different platforms and in-clude a GUI. Mixtures of license models oc-cur if open-source integration packages arebased on commercial tools from the MATtype. With these characteristics, the tools areattractive to algorithm developers and usersin applied research due to expandability andrapid comparison with alternative tools, anddue to easy integration of application-specificmethods and import options.

• EXT are smaller add-ons for other tools suchas Excel, Matlab, R, and so forth, with limitedbut quite useful functionality. Here, only afew data mining algorithms are implemented



TABLE 4 List of Commercial Tools (Part 2)

Tool Type Link

Molegro Data Modeler SOL www.molegro.comNAG Data Mining Components LIB www.nag.co.uk/numeric/DR/DRdescription.aspNeuralWorks Predict SPEC www.neuralware.com/products.jspNeurofusion LIB www.alyuda.comNeuroshell SPEC www.neuroshell.comOracle Data Mining (ODM) DMS www.oracle.com/technology/products/bi/odm/index.htmlPartek Discovery Suite DMS www.partek.com/softwarePartek Genomics Suite SOL www.partek.com/softwarePolyAnalyst DMS www.megaputer.com/polyanalyst.phpPolyVista BI www.polyvista.comRandom Forests SPEC www.salford-systems.comRapAnalyst SPEC www.raptorinternational.com/rapanalyst.htmlR-PLUS MAT www.experience-rplus.comSAP Netweaver Business Warehouse (BW) BI www.sap.com/platform/netweaver/components/businesswarehouseSAS Enterprise Miner DMS www.sas.com/products/minerSee5 SPEC www.rulequest.comSPAD Data Mining DMS eng.spadsoft.comSQL Server Analysis Services DMS www.microsoft.com/sqlSTATISTICA DMS www.statsoft.com/products/data-mining-solutions/G259SuperQuery DMS www.azmy.comTeradata Database BI www.teradata.comThink Enterprise Data Miner (EDM) DMS www.thinkanalytics.comTIBCO Spotfire DMS spotfire.tibco.comUnica PredictiveInsight DMS www.unica.comWizRule and WizWhy SPEC www.wizsoft.comXAffinity SPEC www.exclusiveore.com

Very popular tools are marked in bold letters.

such as artificial neural networks for Excel(Forecaster XL and XLMiner) or MATLAB(Matlab Neural Networks Toolbox). Thereare commercial or open-source versions, butlicenses for the basic tools must also be avail-able. The user interaction is the same as for thebasic tool, for example, by using a program-ming language (MATLAB) or by embeddingthe extension in the menu (Excel).

• Data mining libraries (LIBs) implement datamining methods as a bundle of functions.These functions can be embedded in othersoftware tools using an Application Program-ming Interface (API) for the interaction be-tween the software tool and the data miningfunctions. A graphical user interface is miss-ing, but some functions can support the in-tegration of specific visualization tools. Theyare often written in JAVA or C++ and thesolutions are platform independent. Opensource examples are WEKA (Java-based),MLC++ (C++ based), JAVA Data Min-

ing Package, and LibSVM (C++ and JAVA-based) for support vector machines. A com-mercial example is Neurofusion for C++,whereas XELOPES (Java, C++, and C�) usesdifferent license models. LIB tools are mainlyattractive to users in algorithm developmentand applied research, for embedding datamining software into larger data mining soft-ware tools or specific solutions for narrowapplications.

• Specialties (SPECs) are similar to DMS tools,but implement only one special family ofmethods such as artificial neural networks.They contain many elaborate visualizationtechniques for such methods. SPECs arerather simple to handle as compared withother tools, which eases the use of such toolsin education. Examples are CART for deci-sion trees, Bayesia Lab for Bayesian networks,C5.0, WizRule, Rule Discovery System forrule-based systems, MagnumOpus for asso-ciation analysis, and JavaNNS, Neuroshell,



TABLE 5 List of Free and Open-Source Tools

Tool Type Link

ADaM∗ LIB datamining.itsc.uah.edu/adamCellProfilerAnalyst SOL www.cellprofiler.org/index.htmD2K∗ DMS alg.ncsa.uiuc.eduGait-CAD INT sourceforge.net/projects/gait-cadGATE SOL gate.ac.uk/downloadGIFT RES www.gnu.org/software/giftGnome Data Mine Tools DMS www.togaware.com/datamining/gdatamineHimalaya RES himalaya-tools.sourceforge.netImageJ SOL rsbweb.nih.gov/ijITK SOL www.itk.orgJAVA Data Mining Package LIB sourceforge.net/projects/jdmpJavaNNS SPEC www.ra.cs.uni-tuebingen.de/software/JavaNNS/welcome e.htmlKEEL INT www.keel.esKepler MAT kepler-project.orgKNIME INT www.knime.orgLibSVM LIB www.csie.ntu.edu.tw/ cjlin/libsvmMEGA SOL www.megasoftware.net/m distance.htmlMLC++ LIB www.sgi.com/tech/mlcOrange LIB www.ailab.si/orangePegasus RES www.cs.cmu.edu/ pegasusPentaho BI sourceforge.net/projects/pentahoProximity SPEC kdl.cs.umass.edu/proximity/index.htmlPRTools EXT www.prtools.orgR MAT www.r-project.orgRapidMiner DMS www.rapidminer.comRattle INT rattle.togaware.comROOT LIB root.cern.ch/rootROSETTA SPEC www.lcb.uu.se/tools/rosetta/index.phpRseslibs RES logic.mimuw.edu.pl/ rsesRule Discovery System∗ SPEC www.compumine.comRWEKA INT cran.r-project.org/web/packages/RWeka/index.htmlTANAGRA INT eric.univ-lyon2.fr/ ricco/tanagra/en/tanagra.htmlWaffles LIB waffles.sourceforge.netWEKA DMS, LIB sourceforge.net/projects/wekaXELOPES Library∗ LIB www.prudsys.de/en/technology/xelopesXLMiner∗ EXT www.resample.com/xlminer

Very popular tools are marked in bold letters.∗, Commercial tools with free licenses for academic use.

NeuralWorks Predict, RapAnalyst for artifi-cial neural networks.

• RES are usually the first—and not alwaysstable—implementations of new and innova-tive algorithms. They contain only one or afew algorithms with restricted graphical sup-port and without automation support. Importand export functionality is rather restrictedand database coupling is missing or weak.RES tools are mostly opensource. They aremainly attractive to users in algorithm devel-opment and applied research, specifically in

very innovative fields. Examples are GIFT forcontent-based image retrieval, Himalaya formining maximal frequent item sets, sequentialpattern mining and scalable linear regressiontrees, Rseslibs for rough sets, and Pegasus forgraph mining. Early versions of today’s pop-ular tools such as WEKA and RapidMinerstarted in this category and shifted later toother categories as DMS.

• Solutions (SOLs) describe a group of toolsthat are customized to narrow applicationfields such as text mining (GATE), image



processing (ITK, ImageJ), drug discovery(Molegro Data Modeler), image analysis inmicroscopy (CellProfilerAnalyst), or mininggene expression profiles (Partek GenomicsSuite, MEGA). The advantage of these so-lutions is the excellent support of domain-specific feature extraction techniques, eval-uation measures, visualizations, and importformats. The level of data mining methodsranges from rather weak support (particularlyin image processing) to highly developed al-gorithms. In some cases, more general toolsfrom types DMS or INT also support spe-cific domains (KNIME, Gait-CAD for peptidechemoinformatics). There are many commer-cial and open-source solutions.

A large variety of tools actually requires a fuzzy cat-egorization with gradual memberships to differenttypes. Examples are tools including a set of differ-ent algorithms (LIB) with an additional GUI acting asan INT, DMS, including special methods for narrowapplication fields and others. In these cases, a maintype was assigned and the other fuzzy membershipsare discussed in the Excel table in the additional ma-terial section.

The following kinds of tools were not includedin the comparison:

• nonavailable software (e.g., owing to com-pany mergers or stopped developments) isonly listed in the Excel table in the additionalmaterial,

• software for the handling of data warehouseswithout explicit focus on data mining,

• software for the manual design and applica-tion of rule-based systems,

• software for table calculation with a focus tooffice users, and

• customized solutions for very narrowfields.

CONCLUSION

Many advanced tools for data mining are availableeither as open-source or commercial software. Theycover a wide range of software products, from com-fortable problem-independent data mining suites, tobusiness-centered data warehouses with integrateddata mining capabilities, to early research prototypesfor newly developed methods. In this paper, nine dif-

ferent types of tools are presented: DMS, BIs, MATs,INT, EXT, SPECs, RES, LIBs, and SOLs. They vary inmany different characteristics, such as intended usergroups, possible data structures, implemented tasksand methods, interaction styles, import and exportcapabilities, platforms and license policies are vari-able. Recent tools are able to handle large datasetswith single features, time series, and even unstruc-tured data-like texts; however, there is a lack of pow-erful and generalized mining tools for multidimen-sional datasets such as images and videos.

SUPPLEMENTARY INFORMATION

An additional Excel table contains a list of 269 tools(195 recent and 74 historical tools, version from July22, 2010). For each tool, the following informationis available:

• toolbox name,

• company or group (with the term ‘various’for open-source projects without an explicitdeveloper),

• categorization into types with abbreviationsfor Research Prototypes (RES), Data Min-ing Libraries (LIB), Business Intelligence Pack-ages (BI), Data Mining Software (DMS),Specialties (SPEC), Mathematical Packages(MAT), Extensions (EXT), Integration Pack-ages (INT), Solutions (SOL),

• Giraud-Carrier: marking the covering by theExcel table in Ref 12 (Stand: February 3,2010) with the values 1 (included in a de-tailed categorization), −1 (excluded), emptyfield: not mentioned,

• remarks,

• web link,

• activity: 1 (relevant tool, included in the com-parison), 0 (less relevant), −1 (not available).

• license: OS, open source; CO, commercial;CO/OS, different versions available.

There are a number of regularly updated web re-sources with link lists, but lacking a criteria-basedcomparison of the tools. The most important web re-sources are:

• KDnuggets: http://www.kdnuggets.com/software/suites.html, including regular pollsto identify the most frequently used tools,



• The Data Mine: http://www.the-data-mine.com/bin/view/software

• The Open Directory Project: http://www.dmoz.org/Computers/Software/Databases/Data Mining

• Sourceforge (very popular platform for open-source solutions, search for ‘data mining’ to

find data mining tools hosted at Sourceforge):http://sourceforge.net/

• Kernel Machines (especially to get a listof software to support vector machines):http://www.kernel-machines.org/software

• Tools for Bayesian Networks: www.cs.helsinki.fi/research/cosco/Bnets.

ACKNOWLEDGEMENTS

The authors thank C. Giraud-Carrier for a copy of an Excel table containing a large set of datamining tools, the anonymous reviewers for many comments and suggestions, and R. A. Kladyfor the critical proofreading of the manuscript.

REFERENCES

1. Fayyad U, Piatetsky-Shapiro G, Smyth P. From datamining to knowledge discovery in databases. AI Mag1996, 17:37–54.

2. Smyth P. Data mining: Data analysis on a grand scale?Stat Methods Med Res 2000, 9:309–327.

3. Lovell MC. Data mining. Rev Econ Stat 1983, 65:1–11.

4. Han J, Kamber M. Data Mining: Concepts and Tech-niques. San Francisco: Morgan Kaufmann; 2006.

5. Hastie T, Tibshirani R, Friedman J. The Elements ofStatistical Learning: Data Mining, Inference, and Pre-diction. New York: Springer; 2008.

6. Engelbrecht AP. Computational Intelligence - An In-troduction. Chichester: John Wiley; 2007.

7. Vesset D, McDonough B. Worldwide business intel-ligence tools 2008 vendor shares, IDC CompetitiveAnalysis Report (2009).

8. Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B,Witten I. Weka: A machine learning workbench fordata mining. Data Mining and Knowledge DiscoveryHandbook: A Complete Guide for Practitioners andResearchers. New York: Springer; 2005, 1305–1314.

9. Goebel M. A survey of data mining and knowledgediscovery software tools, ACM SIGKDD Explorations.Newsletter 1999, 1:20–33.

10. Wang J, Hu X, Hollister K, Zhu D. A comparison andscenario analysis of leading data mining software. IntJ Knowl Manage 2008, 4:17–34.

11. Wang J, Chen Q, Yao J. Data mining software. In:Tomei L, ed., Encyclopedia of Information Technol-ogy Curriculum Integration. Hershey, PA: InformationScience Publishing; 2008, 173–178.

12. Giraud-Carrier C, Povel O. Characterising data miningsoftware. Intell Data Anal 2003, 7:181–192.

13. Chen X, Ye Y, Williams G, Xu X. A survey of opensource data mining systems, Lecture Notes in Com-puter Science 2007, 4819:3–14.

14. Alcala-Fdez J, Sanchez L, Garcıa S, del Jesus M,Ventura S, Garrell J, Otero J, Romero C, Bacardit J,Rivas V, et al. KEEL: A software tool to assess evo-lutionary algorithms for data mining problems. SoftComput 2009, 13:307–318.

15. Haughton D, Deichmann J, Eshghi A, Sayek S, TeebagyN, Topi H. A review of software packages for datamining. Am Stat 2003, 57:290–310.

16. Barrett T, Troup D, Wilhite S, Ledoux P, Rudnev D,Evangelista C, Kim I, Soboleva A, Tomashevsky M,Edgar R. NCBI GEO: Mining tens of millions of ex-pression profiles–database and tools update. Nucleicacids Res 2007, D760.

17. Weiss S. Text mining: predictive methods for analyzingunstructured information. New York: Springer-Verlag;2005.

18. Dillmann R. Teaching and learning of robot tasks viaobservation of human performance. Rob Auton Syst2004, 47:109–116.

19. Leach A, Gillet V. An Introduction to Chemoinformat-ics. Springer; 2007.

20. Shearer C. The CRISP-DM model: The new blueprintfor data mining. J Data Warehousing 2000, 5:13–22.

21. Mikut R, Reischl M, Burmeister O, Loose T. Data min-ing in medical time series. Biomed Tech 2006, 51:288–293.



22. Grossman R, Hornick M, Meyer G. Data mining stan-dards initiatives. Commun ACM 2002, 45:61.

23. Muthukrishnan S. Data Streams: Algorithms and Ap-plications. Hanover, MA: Now Publishers Inc.; 2005.

24. Chakrabarti D, Faloutsos C. Graph mining: laws, gen-erators, and algorithms. ACM Comput Surv (CSUR)2006, 38:1–69.

25. Borgelt C. Graph mining: An overview. Proc., 19.Workshop Computational Intelligence. Karlsruhe,Germany: KIT Scientific Publishing; 2009, 189–203.

26. Datta R, Joshi D, Li J, Wang J. Image retrieval: Ideas,influences, and trends of the new age. ACM ComputSurv (CSUR) 2008, 40:1–60.

27. Zhu X, Wu X, Elmagarmid A, Feng Z, Wu L. Videodata mining: Semantic indexing and event detectionfrom the association perspective. IEEE Trans KnowlData Eng 2005, 17:665–677.

28. Damashek M. Gauging similarity with n-Grams:Language-independent categorization of text. Science1995, 267:843–848.

29. Jain AK, Duin RPW, Mao J. Statistical pattern recog-nition: A review. IEEE Trans Pattern Anal Mach Intell2000, 22:4–36.

30. Breiman L. Random forests. Mach Learn 2001, 45:5–32.

31. Pawlak Z. Rough sets and intelligent data analysis. InfSci 2002, 147:1–12.

32. Pechter R. What’s PMML and what’s new in PMML4.0?, ACM SIGKDD Explorations. Newsletter 2009,11:19–25.

33. Hornick M, Marcade E, Venkayala S. Java DataMining: Strategy, Standard, and Practice: A Practi-cal Guide for Architecture, Design, and Implementa-tion. San Francisco: Morgan Kaufmann Publishers Inc.;2006.

34. Anand S, Grobelnik M, Herrmann F, Hornick M,Lingenfelder C, Rooney N, Wettschereck D. Knowl-edge discovery standards. Artificial Intelligence Review2007, 27:21–56.

35. Cannataro M, Congiusta A, Pugliese A, Talia D,Trunfio P. Distributed data mining on grids: Services,tools, and applications. IEEE Trans Syst Man CybernB Cybern 2004, 34:2451–2465.

36. Sonnenburg S, Braun M, Ong C, Bengio S, Bottou L,Holmes G, LeCun Y, Muller K, Pereira F, RasmussenC, et al. The need for open source software in machinelearning. J Mach Learn Res 2007, 8:2443–2466.

37. Bitterer A. Open-source business intelligence tool pro-duction deployments will grow five-fold through 2010,Gartner RAS Research Note G00171189 (2009).


data mining tool reviews march 2011.pdf

Documents