data mining: knowledge discovery in databases peter van der putten alp group, liacs pre-university...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Data Mining:
Knowledge Discovery in Databases
Peter van der Putten
ALP Group, LIACS
Pre-University College
LAPP-Top Computer Science
February 2005
All applications
Expert knowledge 29.8% accepted
12.7% infection
34.5% accepted
Prediction model plus rules
9.1% infection
Accepted Accepted volumevolume
Data mining case study Credit Scoring for Loan Acceptance
© Chordiant Software
Data mining case study Credit Scoring for Loan Acceptance
© Chordiant Software
Data mining case studyClassifying Leukemia
• Problem:– Leukemia (different types of Leukemia cells look very
similar)– Given data for a number of samples (patients), can
we• Accurately diagnose the disease? • Predict outcome for given treatment?• Recommend best treatment?
• Solution– Data mining on micro-array data
Data mining case studyClassifying Leukemia
• 38 training patients, 34 test patients, ~ 7,000 patient attributes (microarry gene data)
• 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML)
• Use train data to build diagnostic model
ALL AML
• Results on test data: 33/34 correct, 1 error may be mislabeled
5 million terabytes created in 2002
• UC Berkeley 2003 estimate: 5 exabytes (5 million terabytes) of new data was created in 2002.
• Twice as much information was created in 2002 as in 1999 (~30% growth rate)
• Other growth rate estimates even higher• Very little data will ever be looked at by a human• Knowledge Discovery is NEEDED to make
sense and use of data.
Sources of (artificial) intelligence
• Reasoning versus learning• Learning from data
– Patient data– Customer records– Stock prices– Piano music– Criminal mugshots– Websites– Robot perceptions– Etc.
Some working definitions….
• ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably
• Data mining = – The process of discovery of interesting, meaningful and
actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial
intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….
Some working definitions….
• Concepts: kinds of things that can be learned– Aim: intelligible and operational concept description– Example: the relation between patient characteristics
and the probability to be diabetic• Instances: the individual, independent examples of a
concept– Example: a patient, candidate drug etc.
• Attributes: measuring aspects of an instance– Example: age, weight, lab tests, microarray data etc
• Pattern or attribute space
Data mining tasks
• Predictive data mining– Classification: classify an instance into a category– Regression: estimate some continuous value
• Descriptive data mining– Matching & search: finding instances similar to x– Clustering: discovering groups of similar instances– Association rule extraction: if a & b then c– Summarization: summarizing group descriptions– Link detection: finding relationships– …
Data Mining Tasks: Search
f.e. age
f.e.
wei
ght
Finding best matching instances
Every instance is a point in pattern space. Attributes are the dimension of an instance, f.e. Age, weight, gender etc.
Pattern spaces may be high dimensional (10 to thousands of dimensions)
Data Mining Tasks: Classification
age
weig
ht
Goal classifier is to seperate classes on the basis of known attributes
The classifier can be applied to an instance with unknow class
For instance, classes are healthy (circle) and sick (square); attributes are age and weight
Data Mining Tasks: Clustering
f.e. age
f.e.
wei
ght
Clustering is the discovery of groups in a set of instances
Groups are different, instances in a group are similar
In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user
Data Mining Tasks: Clustering
f.e. age
f.e.
wei
ght
Clustering is the discovery of groups in a set of instances
Groups are different, instances in a group are similar
In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user
In >3 dimensions this is not possible
Examples of Classification Techniques
• Majority class vote• Machine learning & AI• Decision trees• Nearest neighbor• Neural networks• Genetic algorithms / evolutionairy computing• Artificial Immune Systems• Good old statistics• …..
Example Classification Algorithm 1Decision Trees
20000 patients
age > 67
18800 patientsgender = male?
1200 patientsWeight > 85kg
800 customersDiabetic (%10) etc.400 patients
Diabetic (%50)
no
noyes
yes
no
Decision Trees in Pattern Space
age
weig
ht
Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income
Each line corresponds to a split in the tree
Decision areas are ‘tiles’ in pattern space
Decision Trees in Pattern Space
age
wei
ght
Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income
Each line corresponds to a split in the tree
Decision areas are ‘tiles’ in pattern space
Special Cases of Decision Trees
• Depth = 0– Majority class classifier (ZeroR)
• Depth = 1– One question only– Also known as decision stump
• Depth = n– Any amount of branches
• Various algorithms exist to learn the tree from data– Major difference is criterion to determine on what attribute
value to split
Example classification algorithm 2:Nearest Neighbour
• Data itself is the classification model, so no abstraction like a tree etc.
• For a given instance x, search the k instances that are most similar to x
• Classify x as the most occurring class for the k most similar instances
= new instance
Any decision area possible
Condition: enough data available
Nearest Neighbor in Pattern Space
Classification
fe age
fe w
eigh
t
Nearest Neighbor in Pattern Space
Voorspellen
f.e. age
bvb.
wei
ght
Any decision area possible
Condition: enough data available
Example classification algorithm 3:Neural Networks
• Inspired by neuronal computation in the brain (McCullough & Pitts 1943 (!))
• Input (attributes) is coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons (for instance diabetic yes/no)
• Algorithm learns to find optimal weight using the training instances and a general learning rule.
invoer:bvb. klantkenmerken
uitvoer:bvb. respons
• Example simple network (2 layers)
• Probability of being diabetic = f (age * weightage + body mass index * weightbody mass index)
Neural Networks
Weightbody mass index
Probability of being diabetic
age body_mass_index
weightage
Neural Networks in Pattern Space
Classification
f.e. age
f.e.
wei
ght
Simpel network: only a line available (why?) to seperate classes
Multilayer network:
Any classification boundary possible
Descriptive data mining:association rules
• Discovery of interesting patters• Rule format: if A (and B and C etc) then Z• Example:
– If customer buys potatoes (A) and sauerkraut (B) then customer buys sausage (Z)
• Quality measures for a rule– Support condition: how often do potatoes and sauerkraut
occur together (A,B)– Confidence rule: how often do sausages then occur / support
conditions (is A,B C always true?)
Some examples of my research areas(Jointly with students)
• Mix between applications and new algorithms– Video mining: recognize settings, porn filtering– Artificial Immune Systems: copying learning ability of immune
systems– Predicting Survival Rate for Throat Cancer Patients– Crime Data Mining– Fusing Data from Multiple Sources– Decisioning: offering the right product to the right customer
using predictions– Bias variance evaluation: distinguish between different
sources of error for a classifier