®
IBM Software Group
© 2004 IBM Corporation
Knowledge Discovery and Data Mining
Toni BollingerIBM Development Lab, Böblingen, Germany
IBM Software Group | DB2 information management software
2
IBM Development Lab Böblingen
IBM Software Group | DB2 information management software
3
IBM Development Lab Böblingen
Linux for zSeriesSystems Management
z/OS ComponentseLiza Initiative
SAP SolutionsVSE
Competence CenterLinux CoCExecutive Briefing Center
Server GroupData
Management
Content Management/ Information Integration
CommonStoreText Mining Text Search
DB2 UDBDB2 Extenders
Business IntelligenceData Mining
DB2 Performance ToolsSAP DB2 Multiplatform
DM ServicesTechnical Marketing
Business Partner & Sales Support
Software GroupHardware
Systems MicrocodeEmbedded Service ControllersServer System NestSimulation of ESG systemsServer Micropro's
Software
Platform Technology Platform Strategy
Application Integration &
Pervasive
Life SciencesWebSphere Business Integrator Financial NetworksMERVA CoreBank
Industry SolutionsIGS Development & Services
SW Solutions & Services
WebSphereWorkflow WebSphere Portal Server Pervasive Computing
SpeechREXXOffice SolutionsASIC Design Center
PRIZMA SwitchOEM Micropro's
Technology Group
IBM Software Group | DB2 information management software
4
IBM Research and Development – world wide
ResearchHardware DevelopmentSoftware DevelopmentHardware and Software Development
GreenockDublinHursleyHavant
BöblingenZürichLa GaudeRom
RochesterBoulder
AlmadenSan Jose
Santa TeresaTucsonAustin
TorontoBurlingtonEndicott
East FishkillPoughkeepsieYorktown HeightsRaleigh
YasuTokioFujisavaYamato Delhi
Haifa
Peking
2001: 3411 US Patents
IBM Software Group | DB2 information management software
5
Outline of the talk
§ Introduction
§ Data Mining Techniques4Association Rules4Clustering4Tree Classification
§ Data Mining Methodology4Data Mining Proceess
§ KDD & Data Mining - Where are we now?4Data Mining Standards4Data Mining Solutions
IBM Software Group | DB2 information management software
6
Introduction
§ IJCAI 1989, Workshop on Knowledge Discovery in Databases4 Introduction of term Knowledge Discovery4Brought together researchers from different disciplines§ Artificial Intelligence
– Machine learning– Neural networks
§ Statistics§ Database research§ Visualization
4Motivation§ Amount of data is growing exponentially§ Ability to understand these data is lagging behind§ Growing need for intelligent analysis techniques
IBM Software Group | DB2 information management software
7
Introduction
§ D. Michie 19904 „the area that is going to explode is the use of machine learning tools as a
component of large-scale data analysis
§ Silberschatz, Stonebracker, Ullman: Database systems: Achievements and Opportunities4Data mining ranked as the most promising research topics for 1990s
IBM Software Group | DB2 information management software
8
Definition of the term „knowledge discovery“
§ W. Frawley, G. Piatesky-Shapiro, C. Matheus:
Knowledge discovery is 4 the nontrivial extraction 4of previously unknown and 4potentially useful information
from data.
IBM Software Group | DB2 information management software
9
Challenges for Knowledge Discovery
§ Real Data4Different data types:§ Numeric§ Categorical
4Structured/unstructered information4High/low number of different values4Missing/invalid values4 Inconsistencies, noise in data§ Different encodings of the same information
§ High volumes of data4Lerning sets in machine learning up to 1000 tupels.4Number of records in real data several millions.
IBM Software Group | DB2 information management software
10
Business Trends in the 90ies
§ Customer Relationship Mangement4 Knowing more about the customer for enhancing the customer life time value§ Cross and up selling§ Target marketing§ Customer acquisition and churn prevention
§ Supply Chain Management
§ E-business, Internet4 All interactions with the customers are through a computer4 Enormous amounts of data available4 Application areas§ Personalization§ Click stream analysis§ Product recommendations
IBM Software Group | DB2 information management software
11
Data Warehousing, Business Intelligence
§ Data is extracted from the operational systems and stored in a data warehouse in a systematic and unified way.
§ Data in the warehouse represents the „truth“ for a company.
§ Data is used for decison support
§ Analysis techniques4SQL – Queries, Standard Reporting§ Revenue of the month, 5 most and 5 less frequently products, ..
4Online Analytic Processing (OLAP)§ Revenue per month per product per region at different hierarchy levels§ An OLAP cube represents a set different queries
4Data Mining
IBM Software Group | DB2 information management software
12
Data Mining <-> OLAP, Standard Reporting
problem -> hypotheses?
verification of the hypotheses
SQL, OLAP
known correlations
generation ofhypotheses
Data Mining
Bekannte Zusammenhänge
+ unknown correlations
IBM Software Group | DB2 information management software
13
Data Mining Techniques
§ Discovery – unsupervised learning4Clustering4Associations Rules4Sequential Patterns
§ Prediction – supervised learning4Classification 4Regression
IBM Software Group | DB2 information management software
14
Association Rules Discovery
§ Discovery of rules in data4Which combinations/events occur simultaneously with a high frequency4Which combinations are unsual?
§ Application areas4Market basket analysis§ If someone buys low fat margerine then s/he buys brie cheese.
4Web log analysis§ If some visits page x s/he visits page y as well.
4Quality mangement§ If defect x occurs then defect y
IBM Software Group | DB2 information management software
15
Association rules attributes
100 transactions50 transactions with
low fat margerine
30 transactions with
brie cheese 20 xboth
support = 20% = 20/100confidence = 67% = 20/30lift = 1.3 = confidence/50%
low fat margerine à brie cheese
IBM Software Group | DB2 information management software
16
Association rules discovery
§ Discovery problem:4 Given § a set of transactions T, § values for minimum support and confidence,
4 find all rules Aà B that satisfy these constraints
§ Discovery algorithm1. Find all item sets with minimum support.
These are called frequent item sets.2. For each frequent item set IF
1. compute all partitions into two disjoint subsets IF1 and IF2
2. compute the confidence of the rule IF1à IF2
3. and keep those rules with a confidence greater or equal to the minimum confidence
IBM Software Group | DB2 information management software
17
Apriori algorithm for frequent item sets(R. Agrawal, R. Srikant)
§ Property of frequent item sets:4If an item set is frequent every subset is frequent as well.
1.N=1;Determine all frequent item sets with 1 element IS(1)
2.From the frequent item sets with n elements IS(N) build the candidate set for frequent item sets with N+1 elements.
3.Determine the support for the candidate item sets and retain those with at least minimum support: IS(N+1)
4.If IS(N+1) == {} return IS(1) ∪ IS(2) ∪ .... ∪ IS(N)
5.N=N+1; goto 2
IBM Software Group | DB2 information management software
18
Example:
§ transaction table
§ minimum support 40 % (2 transactions)
§ minimum confidence 70 %
softdrinkT4
beerT3
waterT5
beerT4
juiceT4
waterT3
juiceT3
wineT2
softdrinkT2
juiceT2
beerT1
softdrinkT1
juiceT1
itemtransaction
IBM Software Group | DB2 information management software
19
Frequent 1-element item sets
2Wasser
1Wein
3Bier
3Cola
4Saft
supportitem
softdrinkT4
beerT3
waterT5
beerT4
juiceT4
waterT3
juiceT3
wineT2
softdrinkT2
juiceT2
beerT1
softdrinkT1
juiceT1
itemtransaction
IBM Software Group | DB2 information management software
20
Frequent 2 – element item sets
1beer, water
0softdrink, water
2softdrink, beer
1suice, water
3juice, beer
3juice, softdrink
support2 e. candidates
softdrinkT4
beerT3
waterT5
beerT4
juiceT4
waterT3
juiceT3
wineT2
softdrinkT2
juiceT2
beerT1
softdrinkT1
juiceT1
itemtransaction
IBM Software Group | DB2 information management software
21
Frequent 3 – element item sets
2juice, softdrink, beer
support3 e. candidates
softdrinkT4
beerT3
waterT5
beerT4
juiceT4
waterT3
juiceT3
wineT2
softdrinkT2
juiceT2
beerT1
softdrinkT1
juiceT1
itemtransaction
IBM Software Group | DB2 information management software
22
Determining the confidence
66 %32beer→ softdrink
66 %32beer & juice→ softdrink
100 %22softdrink & beer→ juice
66 %32juice & softdrink→ beer
66 %32softdrink→ beer
100 %33beer→ juice
75 %43juice→ beer
100 %33softdrink→ juice
75 %43juice→ softdrink
confidencerule body -support
supportrule
IBM Software Group | DB2 information management software
23
Association sample with Intelligent Miner Visualization
IBM Software Group | DB2 information management software
24
Association sample with Intelligent Miner Visualization –graphical view
IBM Software Group | DB2 information management software
25
Clustering - Task
§ Given a set of data records (a relational table)
§ Find a partitioning of this set into disjunct subsets (clusters,segments) such that the elements within a subset have a high similarity and the elements of diffferent subsets have a high dissimilarity.
IBM Software Group | DB2 information management software
26
Clustering Methods
§ Neural networks4 Kohonen Feature Maps
§ for numeric data§ categorical data has to be transformed to numeric data§ number of clusters given by size of the network
§ Statisticsal methods4 K-means Clustering
§ for numeric data§ categorical data has to be transformed to numeric data§ Number of clusters has to be specified by user
§ Demographic clustering (IBM DB2 Intelligent Miner)4 Initially for categorical data only4 Extented to deal with numeric data as well4 Number of clusters detected by clustering algorithm
IBM Software Group | DB2 information management software
27
Application areas
§ Customer segmentation based on shopping behaviour and demographic data
4 Enables targeted marketing actions
§ Store segmentation/profiling4 Product offering can be adapted to the characteristics of the
segment the store belongs to
§ Fraud detection4 Outliers, unusual behaviour can be contained in small clusters,
niches
IBM Software Group | DB2 information management software
28
Clustering example – online banking
segmentation algorithm
ABCDEF
revenue gender domicile agehigh M urban < 30low M rural 30-40middle F urban < 30very high M urban > 40high F urban < 30low M rural > 40
A B C D E FA 4 1 2 2 3 1BCDEF
1 4 0 1 0 32 0 4 1 3 02 1 1 4 1 23 0 3 1 4 01 3 0 2 0 4
A C E B F DA 4 2 3 1 1 2CEBFD
2 4 3 0 0 13 3 4 0 0 11 0 0 4 3 11 0 0 3 4 22 1 1 1 2 4
revenue gender domicile ageD very high M urban > 40
ACE
revenue gender domicile agehigh M urban < 30middle F urban < 30high F urban < 30
revenue gender domicile ageBF
low M rural 30-40low M rural > 40
Ergebnis: 3 Segmente
IBM Software Group | DB2 information management software
29
Clustering Example with IM Visualization
IBM Software Group | DB2 information management software
30
Prediction
independant variables
Comparison:actual - predicted value
diseased
YN
paintype
angina
num vessels
thaldiseas
ed
5 0 3 3 Y2 0 0 7 N
dependant variable
Training mode
Test mode:
"historical data"
"historical data"
paintype
angina
num vessels
thaldiseas
ed
3 1 2 2 N1 0 2 4 N
IBM Software Group | DB2 information management software
31
Prediction
paintype
angina
num vessels
thal
4 1 1 53 0 0 7
predicted valuesdiseas
ed
YN
Application mode:
"new data"
IBM Software Group | DB2 information management software
32
Some prediction techniques
§ Prediction of categorical values – classification4 neural networks: back propagation4 decision trees4 rule induction
§ Prediction of numeric values – regression4 neural networks: back propagation4 linear, polynomial, logistic regression4 radial basis functions4 decision trees4 support vector machines
IBM Software Group | DB2 information management software
33
Application areas
§ Churn prevention, in particular of profitable customers
§ Prediction of credit worthiness
§ Prediction of interest in marketing campaigns
§ Analysis of quality problems in manufacturing
§ ...
IBM Software Group | DB2 information management software
34
Decision trees
IBM Software Group | DB2 information management software
35
Confusion Matrix
IBM Software Group | DB2 information management software
36
Gains chart
IBM Software Group | DB2 information management software
37
Gains chart – comparsion between two models
IBM Software Group | DB2 information management software
38
Knowledge Discovery Methodology
§ Challenges in Knowledge Discovery4Real data4Huge volumes of data 4Completeley automatic discovery not realistic
§ Discovered Knowledge should be useful4You have know what kind of information you are interested in.4The purpose is important.
IBM Software Group | DB2 information management software
39
KDD Process§ According to the CRISP-DM -
„CRoss Industry Standard Process for Data Mining“
IBM Software Group | DB2 information management software
40
Distribution of the effort
Modeling5%
Data Acquisition40%Data
Pre-Processing 30%
5%
Model Deployment
10%
Data Cleansing/Transformation
10%
Data Discovery/Modeling
IBM Software Group | DB2 information management software
41
Data Mining Products - Workbenches
§ SPSS Clementine
§ IBM Intelligent Miner for Data
§ SAS Enterprise Miner
§ ...
IBM Software Group | DB2 information management software
42
Data Mining – Myth and Reality (1)
§ Arno Penzias, Nobel laureate and former chief of Bell Labs (January 1999, in an interview with ComputerWorld:„Data Mining will become much more important. Your bank will knoweverything you‘ve bought. Companies will throw away nothing they know about their customers, because it will be so valuable. If you‘re not doing this, you‘re out of business.“
IBM Software Group | DB2 information management software
43
Data Mining – Myth and Reality (2)§ Gartner Group Hype Cycle for BI (December 2002)
IBM Software Group | DB2 information management software
44
Data Mining – Myth and Reality (3)
§ My assessment:4Data mining is exiting.4Data mining is difficult: „You need a PhD in statistics to do data mining“§ Understand the business problem and the data.§ Map the business problem to a data mining problem.§ Prepare the data and run the mining techniques.§ Evaluate the results.
4The results are in most cases not spectacular, but they are valuable.§ Where are the nuggets?§ Most the results are known already.
4Deployment of the mining models in operational processes is difficult.4Privacy is an issue.
IBM Software Group | DB2 information management software
45
What can be done?
§ Standardization
§ Closing the loop – making model deployment easier
§ Hiding the complexity of data mining through data mining solutions
§ Integrated BI Platforms instead of single data mining workbenches
IBM Software Group | DB2 information management software
46
Standardization
§ Is a sign for the maturity of a field.§ Makes the field more interesting for those beyond the „early adopters“
in the technology aption cycle (G.A. Moore, Crossing the Chasm)
§ Is the basis for further progress of the field4Enables reuse of third party components.4Facilitates the development of mining solutions.
Late Majority
Early Majority
Early Adopters LaggardsInnovators
IBM Software Group | DB2 information management software
47
PMML standard for mining models
§ PMML –„Predictive Model Markup Language“4 Driven by the data mining group
www.dmg.org4 Supported by almost all major data
mining vendors4 Allows the interchange of models§ for deployemnt§ for visualization
4 Based on XML
IBM Software Group | DB2 information management software
48
Other standards
§ SQL/MM Data Mining4SQL extension for data mining4Oracle, IBM
§ JSR 734Java standard for data mining4Oracle, SAS, SPSS, SAP, IBM, KXEN, ...
IBM Software Group | DB2 information management software
49
Closing the loop – making model deployment easier§ Model deployment
4Use of mining models in operational application§ For instance campaign management selection of the target group
according to the scores of a prediction model
§ Specific components for model deployment4DB2 IBM Intelligent Miner Scoring –§ allows to apply models inside the database
INSERT INTO IDMMX.ClusterModels values( 'DemoBanking',IDMMX.DM_impClusFile('/tmp/demoBanking.pmml');SELECT d.name, d.age, IDMMX.DM_getClusterId(IDMMX.DM_applyClusModel( cm.model, IDMMX.DM_applData( IDMMX.DM_applData('ae',d.age),
'salary', d.salary))) FROM ClusterModels cm, MyData d WHERE cm.modelname='DemoBanking';
IBM Software Group | DB2 information management software
50
Building Mining Solutions
§ Automated mining solutions4Build applications that hide the complexity of data mining4The users do not have data mining skills4They only have to be able to understand the results
IBM Software Group | DB2 information management software
51
Example: Detection of Insider Trading in Stock Transactions
§ German “Bundesanstalt für Finanzdienstleistungsaufsicht” (BaFin)
§ Mining Solution:4 The BaFin analysts have to enter two parameters only:
§ the stock id§ date and time of the Ad hoc announcement
4 No data mining skills are required for the BaFin analysts.4 The pre-processing, mining and post-processing steps are executed
automatically.4 Based on the mining results, scores for each transaction are computed.
They can be interpreted as a measure of how untypical the transactions are.
4 The BaFin analysts can inspect the stock transactions with these scores with a front-end tool of their choice (like Business Objects).
IBM Software Group | DB2 information management software
52
BaFin Mining Solution
§ BaFin-Bundesanstalt für Finanzdienstleistungsaufsicht(Federal Finance Supervisory Agency)
§ One of its tasks: Detection of insider trading4 Every stock transaction is reported to the BaFin4 The trigger for insider-investigations are "Ad hoc Publications" or other important events4 "Ad hoc Publications" are statements of companies listed at the stock exchange that
may have an influence on the stock valuee.g.; quarter reports, earning warning, mergers with other companies, change of CEO
§ Some figures:4 400 000 stocks and derivatives4 5400 companies listed at German stock exchanges4 5600 Ad hoc publications in 20004 525 million stock transactions in 2000
§ Challenge:4 How can we efficiently and effectively detect information relevant to insider trading in
this huge amount of data?
IBM Software Group | DB2 information management software
53
Adhoc Mining Scenario
§ Mining Scenario that has been developed by IBM in a pilot project:4 It consists of a sequence of§ preprocessing steps (discretization, removal of outliers, pivotization, ...)§ mining steps (associations, clustering)§ post processing steps
4The goal is to find transactions that are untypical§ The pilot project was successful:
4 Insider transactions hidden in the data have been found§ However, the scenario was too complex to be applied regularly by the BaFin
analysts.
IBM Software Group | DB2 information management software
54
Adhoc Mining Web Application
§ The BaFin analysts have to enter a few parameters only in a browser interface:4the stock id4date and time of the Ad hoc publication4..
§ No data mining skills are required for the BaFin analysts.§ The pre-processing, mining and post-processing steps are executed
automatically.§ Based on the mining results, scores for each transaction are
computed. They can be interpreted as a measure of how untypical the transactions are.§ The BaFin analysts can inspect the stock transactions with these
scores in the browser interface as well.§ This scenario has been extended to find untypical trading behavior of
brokers, banks and stock owners.
IBM Software Group | DB2 information management software
55
The Adhoc Browser Interface
IBM Software Group | DB2 information management software
56
The result page
IBM Software Group | DB2 information management software
57
Generalization of such a solution
§ For every customer there are only a few relevant business problems to be solved with data mining
4 Each of these business problems is handled in slight variations
§ Examples:4 Detection of Insider Trading
§ Variation: stock4 Market Basket Analysis
§ Variations: stores, time period
What we need is one solution for one business problem that covers all variations
IBM Software Group | DB2 information management software
58
Intelligent Miner Solution Framework
§ Framework for generating automated mining web applications basedon Intelligent Miner
§ Characteristics of such applications4No mining skills are required for the user.4The user enters only a few parameters 4The mining requests are executed automatically4Results are stored in database tables and can be inspected with the
browser interface.4The "mining knowledge" is contained in the mining scenarios4A mining expert is needed only for the development and maintenance of
these mining scenarios
IBM Software Group | DB2 information management software
59
Business Intelligence Platforms
§ BI Platforms – integrated products that contain the major compenents of an BI solution4Database4ETL operations (Extract/Transform/Load)4OLAP 4Data Mining
§ Offered by major database vendors4Microsoft, 4Oracle, 4 IBM
§ SAP BW
IBM Software Group | DB2 information management software
60
Summary
§ Knowldege Discovery and Data Mining has achieved a certain maturity.
§ It is still a very popular in the academic world.
§ The software market for data mining is consolidating.
§ The major challenges still persist.
IBM Software Group | DB2 information management software
61
Thank you
For your attention!