d2k-tutorial

Loretta Auvil

Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinois217. [email protected]

Supercomputing 2003

D2K Tutorial

alg | Automated Learning Group

Outline

• Overview of D2K Functionality

• Hands-On Exercise: Predictive Modeling• Classification

– Using Naïve Bayesian– Using Decision Trees

• Hands-On Exercise: Discovery• Rule Association

– Using SQL Htree • Clustering

• Deviation Detection• Visualization

– Parallel Coordinates– Small Multiples of scatterplots


Goals

• Understanding the Knowledge Discovery in Databases Process

• Gaining Knowledge of Basic Data Mining Operations and Techniques

• Understanding the Role of the Knowledge Discovery Framework

• Key Issues in Utilization of D2K Framework

• Understanding the Role of Information Visualization in Data Mining


What is It?

Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data

• The understandable patterns are used to:• Make predictions about or classifications of new data• Explain existing data• Summarize the contents of a large database to support decision

making• Create graphical data visualization to aid humans in discovering

complex patterns

Overview of Knowledge Discovery


Knowledge Discovery Process



Required Effort for each KDD Step

Arrows indicate the direction we want the effort to go

0

10

20

30

40

50

60

ObjectivesDetermination

Data Preparation Data Mining Interpretation/Evaluation

Eff

ort

(%

)Overview of Knowledge Discovery


Three Primary Paradigms

• Predictive Modeling – supervised learning approach where classification or prediction of one of the attributes is desired• Classification is the prediction of predefined classes

– e.g. Naive Bayesian, Decision Trees, and Neural Networks

• Regression is the prediction of continuous data

– e.g. Neural Networks, and Decision (Regression) Trees

• Discovery – unsupervised learning approach for exploratory data analysis• e.g. Association Rules, Link Analysis, Clustering, and Self

Organizing Maps

• Deviation Detection – identifying outliers in the data• e.g. Visualization



Importance of Data Mining Framework

• Provides capability to build custom applications

• Provides access to data management tools • Loading data from database, flat file or DataSpaces

• Contains data mining algorithms for prediction and discovery that can be applied

• Provides data transformations for standard operations

• Supports an extensible interface for creating one’s own algorithms

• Provides means for building and applying models

• Provides integrated visualizations components

• Provides access to distributed computing capabilities


D2K - Data To Knowledge

D2K is a flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization

D2K Overview


D2K and Its Many Components

• D2K InfrastructureD2K API, data flow environment, distributed computing framework and runtime system

• D2K ModulesComputational units written in Java that follow the D2K API

• D2K ItinerariesModules that are connected to form an application

• D2K ToolkitUser interface for specification of itineraries and execution that provides the rapid application development environment

• D2K-Driven ApplicationsApplications that use D2K modules, but do not need to run in the D2K Toolkit

D2K Overview


D2K Toolkit

Major features that D2K provides to an application developer include:

• Visual programming system employing a data flow paradigm

• Scalable distributed computing capabilities

• Flexible and extensible software development environment

• Multi-layered learning strategies

• Integrated environment for models and visualization

• Capability to access data transparently from multiple sources

D2K Overview


D2K Basic 1.0

• New release of D2K 3.0• New release of the D2K Toolkit• New release of a set of D2K Modules to perform data mining techniques

• Prediction– Decision Trees

C4.5 Decision Tree, Continuous Decision Tree, SQL Rain Forest Decision Tree– Naïve Bayesian Classification and SQL Naïve Bayesian Classification– Neural Networks

• Discovery– Rule Association

Apriori, Htree– Clustering

Hierarchical Agglomerative, Kmeans, Coverage, etc.

• Better documentation for Toolkit and modules• Includes visualizations for many of the modeling approaches• Includes a set of data transformations

• Attribute selection, binning, filtering, attribute construction

• Includes optimization strategy for searching parameter space• Plus more…

D2K Overview


D2K 3.0 Features

• Current Release downloadable off our website• Extension of existing API

• Provides the capability to programmatically connect modules and set properties

• Allows D2K-driven applications to be developed• Provides ability to pause and restart an itinerary

• Enhanced Distributed Computing• Allows modules that are re-entrant to be executed remotely• Uses Jini services to look up distributed resources• Includes interface for specifying the runtime layout of a distributed itinerary

• Processor Status Overlay • Shows utilization of distributed computing resources

• Distributed Checkpointing• Resource Manager

• Provides a mechanism for treating selected data structures as if they were stored in global memory

• Provides memory space that is accessible from multiple modules running locally as well as remotely

D2K Overview


New D2K 4.0 Highlights

• Ability to use the web for deployment

• Ability for modules to run headless (with no gui)

• Changed the way itineraries are saved• Stored in zip file • Itinerary is described in an xml format• Annotation is saved in html format• Additional data is stored in a serialized HashMap

• Table structure was re-implemented to improve performance and simplify the API

• Improvements of module selection, with area selection

• Support of copy and paste of selected modules

D2K Overview


D2K ToolKit

1. Workspace2. Resource

Panel3. Modules4. Models5. Itineraries6. Visualizatio

ns7. Generated

Visualizations

8. Generated Models

9. Component Information

10.Toolbar11.Console

D2K Overview


D2K Modules

Input Module: Loads data from the outside world• Flat files, database, etc.

Data Prep Module: Performs functions to select, clean, or transform the data• Binning, Normalizing, Feature Selection, etc.

Compute Module: Performs main algorithmic computations• Naïve Bayesian, Decision Tree, Apriori, etc.

User Input Module: Requires interaction with the user• Data Selection, Input and Output selection, etc.

Output Module: Saves data to the outside world• Flat files, databases, etc.

Visualization Module: Provides visual feedback to the user• Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D

Scatterplot, 3D Surface Plot

D2K Overview


D2K Module Icon Description

Module Progress BarAppears during execution to show

the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red

when not

Input PortRectangular shapes on the left side

of the module represent the inputs for the module. They are

colored according to the data type that they represent

Properties SymbolIf a “P” is shown in the lower left

corner of the module, then the module has properties that can

be set before execution

Output PortRectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent

D2K Overview


Resource Panel

The area to the left of the Workspace that contains the components necessary for data analysis• Modules• Models• Itineraries• Visualizations

D2K Overview


D2K Itineraries

• Itineraries are partial or complete applications composed of connected modules

• D2K Core Itineraries include:• Prediction• Discovery• Anomaly Detection• Data Selection• Transformation• Visualization

D2K Overview


Workspace

The Workspace is the area where applications are formed • Modules are placed, connected, and properties set• Itineraries are saved and executed

D2K Overview


Session Panes

• Component Information• Shows detailed information about components of D2K• Shows module information, inputs, outputs, and property

descriptions• Shows itinerary annotation

• Generated Visualization• Shows visualizations generated during this session• Provides ability to save these visualizations for later use

• Generated Models• Shows models generated during this session• Provides ability to save these visualizations for later use

D2K Overview


D2K Setup

• Preferences• Written to a file called “d2k.props”• Set up automatically the first time D2K is installed• Changed via Edit menu… Preferences…• Some changes do require restart of D2K• Check the User Manual for more details (available online)

D2K Overview


Using the Toolkit

Build an itinerary for loading data and viewing it in a TableViewer

• Drag and Drop Modules from Modules Pane of Resource Panel to the Workspace as shown• Expand directory ncsa/io/file/input

– Drag and Drop Input1Filename to Workspace

– Drag and Drop CreateDelimitedParser to Workspace

– Drag and Drop ParseFileToTable to Workspace

• Expand directory ncsa/vis– Drag and Drop TableViewer to

Workspace

D2K Overview


Using the Toolkit (cont’d)

Connect the modules like shown

• Drag from the output port of one module to the input port of the next module

• Check the properties of modules by double clicking on the module• Input File Name

– Choose data/UCI/iris.csv• Create Delimited File Parser

– Defaults work• Parse File To Table

– Defaults work

• Click Run to execute

D2K Overview


Variation Using a Nested Itinerary

• An itinerary can be used as a module – nested itinerary

• Properties can be set by holding Control and double clicking on the nested itinerary

• Then connecting the inputs and output ports of the nested itinerary as one would any other module

D2K Overview


PREDICTIVE MODELING

CLASSIFICATIONNAÏVE BAYESIAN


Naïve Bayesian Classification

• Applied to supervised learning problem

• Expects training examples with input and output attributes

• Single output attribute with small number of possible values for best performance

• Computes the distribution of an input associated with each class, for example, given the variable X with a value at xi the probability of it being in Class A is greater than it being in Class B

Predictive Modeling: Naïve Bayesian

Mathematically speaking — If one knows how P(X | C), and the densities P(xi) and P(cj) (prior probabilities) are known then the classifier is one which assigns class cj to datum xi if cj has the highest posterior probability given the data


Bayesian Classification: Why?

• Probabilistic learning: Calculate explicit probabilities for hypothesis, is among the most practical approaches to certain types of learning problems

• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct

• Prior knowledge: Can be combined with observed data

• Standard: • Provide a standard of optimal decision making against which other

methods can be measured• In a simpler form, provide a baseline against which other methods

can be measured



Naïve Bayesian Classification

• Naïve assumption: • Feature independence

• P(xi|C) is estimated as the relative frequency of examples having value xi as feature in class C

• Computationally easy!!!



Classification Applications Using Naïve Bayesian• Predict a response

to a marketing campaign

• Predict the most profitable customers for a product or service

• Classify applicants as high/med/low risk

• Predict which customers will leave for a competitor

• Predict whether email message is SPAM or not



Opening the Itinerary

• Click on the “Itinerary” Pane in the Resource Panel

• Expand the “Prediction” directory with a single click

• Double click on “NaïveBayes” to load the itinerary into your Workspace



Executing the Itinerary

• Check modules with properties • Double click to open

property editor• Respond to User Interfaces

that open

• Click Run button

• Respond to GUI’s that pop-up



PredictionTableReport for iris data

Double click on the PredictionTableReport to launch the report that shows the classification error and confusion matrix for the data



Naïve Bayesian Visualization

• Double click on the NaiveBayesVis to view the results

• The upper right hand pane shows the distribution of the classes

• The left hand pane shows the attributes and each of their values. They are listed by order of significance• The message box shows

details about each pie chart when brushed

• Clicking on a pie chart shows how knowing this information can change the overall class predication

• Clicking on multiple pie charts calculates conditional probabilities

Notice Iris-versicolor has a 33% likelihood




What if scenarios…

• Click on petal-width of 1.3:1.9

• Now the probability of Iris-versicolor is 66.37%




What if scenarios… continue with conditional probabilities calculations

• Click on petal-length of 3.95:5.32

• Click on sepal-length of 5.28:6.15

• Now the probability of Iris-versicolor is 94.99%



Applying Models

• In Generated Models Session Pane, right click on the model and choose Save

• The saved model shows up in the Model View of the Resource Panel

• Click and drag the model into the workspace

• Connect the input and output of the model as shown



PREDICTIVE MODELING

CLASSIFICATIONDecision Trees


Decision Trees Classification

• Supervised learning problem

• Builds a model to classify one attribute based on other data attributes• Builds the tree by deciding how

to split the data so that classification error is reduced

• Shown is a decision tree predicting whether one will play tennis based on some weather conditions

Predictive Modeling: Decision Trees


Applications Using Decision Trees

• Decision trees can solve both classification and regression problems

• Decision Trees work for many of the same problems as Naïve Bayesian analysis• Prediction of who should be

given a loan• Prediction of high/med/low

risk



PredictionTableReport for iris data

Double click on the PredictionTableReport to launch the report that shows the classification error and a confusion matrix for the dataNote: This is a very clean data set



Decision Tree Visualization

Two main panes

• Navigator Pane shown in the top left pane illustrates the full decision tree, the viewable decision tree is shown with a black box outline

• Viewable Tree shows a chart of the percentages of the examples in each of the classes• Brushing indicates the

percentages in the Brushing Pane

• Clicking on a small chart opens a larger view of the chart -showing the complete path taken to get to this node



Using the Model

• In Generated Models Session Pane, right click on the model and choose Save. The saved model shows up in the Model View of the Resource Panel

• Click and drag the model into the workspace (shown in green circle, disconnect the items in the red blob)

• Connect the input and output of the model as shown• Results can be sent to the

PredictionTableReport and to the DecisionTreeVis

• New (test) data can be examined with the model



DISCOVERYRULE ASSOCIATION

Using fp-growth


Market Basket Example

Is soda typically purchased with bananas?Does the brand of soda make a difference?

Where should detergents be placed in theStore to maximize their sales?

Are window cleaning products purchased when detergents and orange juice are bought together?

How are the demographics of the neighborhood affecting what customers are buying?

?

?

?

?

Discovery: Rule Association


Association Rules

• There has been a considerable amount of research in the area of Market Basket Analysis. Its appeal comes from the clarity and utility of its results, which are expressed in the form association rules

• Given• Database of transactions• Each transaction contains a set of items

• Find all rules X->Y that correlate the presence of one set of items X with another set of items Y• Example: When a customer buys bread and butter, they buy milk

85% of the time



Overview

• Unsupervised learning problem

• Find all rules that correlate the presence of one set of items X with another item Y• Example: When a customer buys bread and butter, they buy milk

85% of the time

• Support is the percentage of the records that contain both X and Y• A rule must have some minimum user-specified support to show its

impact

• Confidence is the percentage of records that contain X and Y out of the number of records that contain X• A rule must have some minimum user-specified confidence to show

its value



Results: Useful, Trivial, or Inexplicable?

• While association rules are easy to understand, they are not always useful

Useful On Fridays convenience store customers often purchase diapers and beer together

Trivial Customers who purchase maintenance agreements are very likely to purchase large appliances

Inexplicable When a new Super Store opens, one of the most commonly sold item is light bulbs



How Does It Work?

Orange juice, Soda

Milk, Orange Juice, Window Cleaner

Orange Juice, Detergent

Orange juice, detergent, soda

Window cleaner, soda

OJ

4

1

1

2

1

OJ

Window Cleaner

Milk

Soda

Detergent

1

2

1

1

0

1

1

1

0

0

2

1

0

3

1

1

0

0

1

2

WindowCleaner Milk Soda Detergent

Co-Occurrence of Products

Customer Items

1

2

3

4

5

Grocery Point-of-Sale Transactions

Orange Juice, Soda

Milk, Orange Juice, Window Cleaner

Orange Juice, Detergent

Orange Juice, Detergent, Soda

Window Cleaner, Soda


• In the data, two of five transactions include both soda and orange juice• These two transactions

support the rule• Support for the rule is

two out of five or 40%

• Since both transactions that contain soda also contain orange juice • There is a high degree

of confidence in the rule• In fact every transaction

that contains soda contains orange juice

• So the rule If soda, THEN orange juice has a confidence of 100%


Confidence and Support - How Good Are the Rules

• A rule must have some minimum user-specified confidence• 1 and 2 -> 3 has a 90% confidence if when a customer bought 1

and 2, in 90% of the cases, the customer also bought 3

• A rule must have some minimum user-specified support• 1 and 2 -> 3 should hold in some minimum percentage of

transactions to have value



Association Examples

• Find all rules that have “Diet Coke” as a consequent (result)

• These rules may help plan what the store should do to boost the sales of Diet Coke

• Find all rules that have “Yogurt” in the antecedent (condition)

• These rules may help determine what products may be impacted if the store discontinues selling “Yogurt”

• Find all rules that have “Brats” in the antecedent and “mustard” in the consequent

• These rules may help in determining the additional items that have to be sold together to make it highly likely that mustard will also be sold

• Find the best k rules that have “Yogurt” in the result



Basic Process

• Choosing the right set of items• Taxonomies

• Virtual Items

• Anonymous versus Signed

• Generation of rules• If condition Then result

• Negation/Dissociation

• Improvement

• Overcoming the practical limits imposed by thousand or tens of thousands of products• Minimum Support Pruning



Strengths and Weaknesses

Strengths

• It produces easy to understand results

• It supports undirected data mining

• It works on variable length data

• Rules are relatively easy to compute

Weaknesses

• It is an exponential growth algorithm

• It is difficult to determine the optimal number of items

• It discounts rare items

• It is limited by the support that it provides attributes

• It produces many rules

• For large numbers of attribute-value combinations, considerable cpu and memory resources are consumed




• Click on the “Itinerary” Pane in the Resource Panel

• Expand the “Discovery” directory with a single click

• Expand the “RuleAssociation” directory with a single click

• Double click on “fp-growth” to load the itinerary into your Workspace

Discovery: Rule Association Using fp-growth


Executing the Itinerary

• Check modules with properties

• Double click to open property editor • fp-growth• Compute Confidence

• Respond to User Interfaces that open

• Click Run button



Rule Association Visualization

• Read rules down the column

• Example - the first rule is • If petal-width Binned=[…:0.7]

then flower-type=Iris-setosa• Support = 25%• Confidence = 100%

• Brush the bars to find out support and confidence levels

• Different sorting schemes• Sort by Confidence• Sort by Support• Alphabetize button sorts the

attribute-value pairs alphabetically

• Rank button sorts the rows based on the current Confidence/Support selection, moving the consequents and antecedents of the highest ranking rules to the top of the attribute-value list



Choosing the Right Set of Items

FrozenFoods

FrozenDesserts

FrozenVegetables

FrozenDinners

FrozenYogurt

FrozenFruit Bars

IceCream Peas Carrots Mixed Other

RockyRoad

Chocolate Strawberry Vanilla CherryGarcia

Other

Part

ial P

rod

uct

Taxon

om

yG

en

era

lS

pecifi

c



Other Association Rule Applications

• Quantitative Association Rules• Age[35..40] and Married[Yes] -> NumCars[2]

• Association Rules with Constraints• Find all association rules where the prices of items are > 100

dollars

• Temporal Association Rules• Diaper -> Beer (1% support, 80% confidence)• Diaper -> Beer (20%support) 7:00-9:00 PM weekdays

• Optimized Association Rules• Given a rule (l < A < u) and X -> Y, Find values for l and u such

that support greater than certain threshold and maximizes a support, confidence, or gain

• ChkBal [$ 30,000 .. $50,000] -> JumboCD = Yes



DISCOVERY

CLUSTERING


Overview

• Unsupervised learning problem

• Group all examples that are similar

• View results with dendogram or parallel coordinates

• Provide several different clustering algorithms• Kmeans• Buckshot• Fractionation• Coverage

Discovery: Clustering


Clustering Algorithms

• KMeans clustering• Creates a sample set containing Number of Clusters rows is chosen from an input

table of examples and used as initial cluster centers • These initial clusters undergo a series of assignment/refinement iterations, resulting

in a final cluster model

• Buckshot clustering • Creates a sample of size Sqrt(Number of Clusters * Number of Examples) is chosen at

random from the table of examples • This sampling is sent through the hierarchical agglomerative clustering module to

form Number of Clusters clusters. These clusters' centroids are used as the initial "means" for the cluster assignment module. The assignment module, once it has made refinements, outputs the final Cluster Model

• Coverage clustering • Creates a sample set from the input table such that the set formed is approximately

the minimum number of samples needed such that for every example in the input table there is at least one example in the sample set of distance = Distance Threshold (% of Maximum)

• This sampling is sent through the hierarchical agglomerative clustering module to form Number of Clusters clusters. These clusters' centroids are used as the initial "means" for the cluster assignment module. The assignment module, once it has made refinements, outputs the final Cluster Model



Clustering Algorithms (2)

• Fractionation• Creates a sample set of the initial examples (converted to clusters)

by a key attribute denoted by Sort Attribute • The set of sorted clusters is then segmented into equal partitions of

size maxPartitionsize • Each of these partitions is then passed through the agglomerative

clusterer to produce numberOfClusters clusters • All the clusters are gathered together for all partitions and the

entire process is repeated until only Number of Clusters clusters remain. The sorting step is to encourage like clusters into same partitions




• Click on “Itinerary” Pane in the Resource Panel

• Expand the “Discovery” directory

• Expand the “Clustering” directory

• Double click on “BuckshotClusterer”



Clustering Results

Dendogram or Parallel Coordinates



DEVIATION DETECTION VISUALIZATIONS

PARALLEL COORDINATESSCATTERPLOT


Itinerary

• Visualization to detect outliers and patterns

• Expand the vis directory and load the “ParallelCoordinate” itinerary

Deviation Detection: Parallel Coordinates


Parallel Coordinates - Visualization

• Each vertical line represents a attribute with the minimum and maximum values shown at bottom and top

• Each record has a line that connects it to the its value at each attribute

• Lines are colored based on the output attribute

• Clicking and dragging on the label boxes allows the attributes to be rearranged

• Zooming is accomplished by dragging a box over the desired area. Clicking returns to the original view

Deviation Detection: Parallel Coordinates


Scatterplots – Itinerary

• Visualization to detect outliers and patterns

• Load the “scatterplot” itinerary

Deviation Detection: Scatterplots


Scatterplots – Visualization

Deviation Detection: Scatterplots


Small Multiples of Scatterplots - Itinerary

Deviation Detection: Small Multiples


Small Multiples of Scatterplots Vis



Small Multiples of Linear Regressions Vis



D2K Streamline (D2K SL)

• Reduces the learning curve associated with the KDD process

• Encompasses discovery, prediction and deviation detection techniques

• Saves and applies models to new data sets easily

• Supports return to earlier steps in the KDD process to run with different parameters

• Uses the D2K Infrastructure transparently

D2K SL


New D2K User Interface – D2K SL

• Provides step by step interface to guide user in data analysis

• Uses same D2K modules

• Provides way to capture different experiments (streams)

D2K SL


Another View of the New D2K User Interface – D2K SL

• Help users keep track of data

• Define templates that can be reused in different experiments (streams)

D2K SL


The ALG Team

StaffLoretta AuvilRuth AydtPeter BajcsyColleen BushellDora CaiDavid ClutterLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge

StudentsTyler AlumbaughPeter GrovesOlubanji IyunSang-Chul LeeXiaolei LiBrian NavarroJeff NgScott RamonSunayana SahaMartin UrbanBei YuHwanjo Yu


Licensing D2K

• Faculty, staff and students at US academic institutions will be able to license and use D2K for free by downloading from alg.ncsa.uiuc.edu

• Private Sector Partners who have provided funding for projects related to D2K will be able to license and use D2K for free

• Private Sector Partners who have not provided funding will be able to license and use D2K for a discounted fee

Contact John McEntireOffice of Technology Management308 Ceramics Building, MC-243105 South Goodwin AvenueUrbana, Illinois 61801-2901(217) [email protected]