dm week01 intro.handout
TRANSCRIPT
Christof MonzInformatics Institute
University of Amsterdam
Data MiningWeek 1: Introduction
Today’s Class
Christof MonzData Minging - Week 1: Introduction
1
I Overview of Data MiningI Overview of Machine LearningI Course administrivia
What’s Data Mining?
Christof MonzData Minging - Week 1: Introduction
2
I Data: Records, web pages, documents, etc.I Mining: The process or business of extracting
ore or minerals from the ground (The AmericanHeritage)
I Data Mining: The nontrivial extraction ofimplicit, previously unknown, and potentiallyuseful information from large amounts of data
Why Data Mining?
Christof MonzData Minging - Week 1: Introduction
3
I There is an abundance of data resources:commercial databases, intranets, the Internet,. . .
I These resources contain a large amount ofvaluable data
I The best way to structure the data depends onhow one wants to exploit it
I Manual data organization is very laborious andexpensive
I There is a need to automate this process
Some Application Areas
Christof MonzData Minging - Week 1: Introduction
4
I Customer analysis (what impacts customerbehavior?)
I Medical research (what is the impact oflifestyle/drug effects?)
I Insurance (risk assessment)I Stock investment (which factors impact stock
performance?)I Fraud detection (when is a transaction likely to
be fraudulent?)
The Need for Automated Analysis
Christof MonzData Minging - Week 1: Introduction
5
I Much of the available data is never analyzed!
What is and isn’t Data Mining
Christof MonzData Minging - Week 1: Introduction
6
I Look up in an electronically available phonebook what John Doe’s phone number andaddress is (isn’t Data Mining but databasemanagement)
I Infer from analyzing a number of web pageswhat John Doe’s phone number is, althoughthis information is not expressed explicitly (isData Mining)
Situating Data Mining
Christof MonzData Minging - Week 1: Introduction
7
I Data Mining lies on the intersection of anumber of research areas
Data Mining Tasks
Christof MonzData Minging - Week 1: Introduction
8
I Prediction• Use some variables to predict unknown or future values
of other variablesI Description
• Find human-interpretable patterns that describe the data
Some Data Mining Tasks
Christof MonzData Minging - Week 1: Introduction
9
I Classification (Predictive)I Clustering (Descriptive)I Association Rule Discovery (Descriptive)I Sequential Pattern Discovery (Descriptive)I Regression (Predictive)I Deviation Detection (Predictive)
Classification
Christof MonzData Minging - Week 1: Introduction
10
I Given a collection of records (training set)• Each record contains a set of attributes, one of the
attributes is the class.
I Find a model for class attribute as a function ofthe values of other attributes
I Goal: previously unseen records should beassigned a class as accurately as possible.• A test set is used to determine the accuracy of the model
Example: Direct Marketing
Christof MonzData Minging - Week 1: Introduction
11
I Goal: Reduce cost of mailing by targeting a setof consumers likely to buy a new cell-phoneproduct
I Approach:• Use the data for a similar product introduced before
• We know which customers decided to buy and whichdecided otherwise. This buy/don’t buy decision formsthe class attribute
• Collect various demographic, lifestyle, andcompany-interaction related information about all suchcustomers (where they stay, how much they earn, . . . )
• Use this information as input attributes to learn aclassifier model
Classify This!
Christof MonzData Minging - Week 1: Introduction
12
Some Observations
Christof MonzData Minging - Week 1: Introduction
13
I Training data (examples for which the class isknown)
I Feature extraction (what are the ’things’ thatare relevant to predict a class?)
I Feature weight (how important is a feature?)I Feature combination (sometimes features act
together)I Over-fitting (some features don’t generalize
well)I Evaluation (how accurate is the prediction?)
Machine Learning
Christof MonzData Minging - Week 1: Introduction
14
I The research area of machine learninginvestigates and formalizes the challenge ofprediction and description by computer
I Machine learning plays a central role in datamining
I It is used for:• Building new models
• Adapting existing models to new situations
• Comparing the performance of competing models
Machine Learning is . . .
Christof MonzData Minging - Week 1: Introduction
15
I . . . the principles, methods, and algorithms forlearning and prediction on the basis of pastexperience
I . . . already everywhere: speech recognition,hand-written character recognition, computervision, information retrieval, operating systems,compilers, fraud detection, security, defenseapplications, . . .
Learning
Christof MonzData Minging - Week 1: Introduction
16
I Steps• entertain a (biased) set of possibilities
• adjust predictions based on feedback
• rethink the set of possibilitiesI Principles of learning are ‘universal’
• society (e.g., scientific community)
• animal (e.g., human)
• machine
Learning and Prediction
Christof MonzData Minging - Week 1: Introduction
17
I We make predictions all the time but rarelyinvestigate the processes underlying ourpredictions
I In carrying out scientific research we are alsogoverned by how theories are evaluated
I To automate the process of making predictionswe need to understand in addition how wesearch and refine ‘theories’
Learning: Key Steps
Christof MonzData Minging - Week 1: Introduction
18
I Data and assumptions• What data is available for the learning task?
• What can we assume about the problem?I Representation
• How should we represent the examples to be classified?I Evaluation and Estimation
• How well are we doing?
• How do we adjust our predictions based on thefeedback?
• Can we rethink the approach to do even better?
Example
Christof MonzData Minging - Week 1: Introduction
19
I A classification problem: predict the grades forstudents taking this course
I Key Steps:1. data
2. assumptions
3. representation
4. estimation
5. evaluation
6. model selection
Example
Christof MonzData Minging - Week 1: Introduction
20
I Key Steps:1. data: what ‘past experience’ can we rely on?
2. assumptions: what can we assume about the students orthe course?
3. representation: how do we ‘summarize’ a student?
4. estimation: how do we construct a map from students togrades?
5. evaluation: how well are we predicting?
6. model selection: perhaps we can do even better?
Example: Data
Christof MonzData Minging - Week 1: Introduction
21
I The data we have available (in principle):• Names and grades of students in past years ML courses
• Academic record of past and current students
I Training data:Student ML course 1 course 2 . . .
Peter A B A . . .David B A A . . .
I Test data:Student ML course 1 course 2 . . .
Jack ? C A . . .Kate ? A A . . .
Assumptions
Christof MonzData Minging - Week 1: Introduction
22
I There are many assumptions we can make tofacilitate predictions:• The course has remained roughly the same over the years
• Each student performs independently from others
Example: Representation
Christof MonzData Minging - Week 1: Introduction
23
I Academic records are rather diverse so we mightlimit the summaries to a select few courses
I For example, we can summarize the i th student(say David) with a vector: xi = [B A A]
I The available data in this representation:Training Testing
Student ML grade Student ML grade
x1 A x ′1 ?x2 B x ′2 ?. . . . . . . . . . . .
Example: Estimation
Christof MonzData Minging - Week 1: Introduction
24
I Given the training dataStudent ML grade
x1 Ax2 B. . . . . .
find a mapping from input vectors x to ‘labels’y encoding the grades for the ML course.
I Possible solution (nearest neighbor classifier):1. For any student x in the test set find the ‘closest’
student xi in the training set
2. Predict yi as the grade of the closest student
Example: Evaluation
Christof MonzData Minging - Week 1: Introduction
25
I How can we tell how good our predictions are?• We can wait till the end of this course
• We can try to assess the accuracy based on the data wealready have (part of the training data)
I Possible solution:• Divide the training set further into training and test sets
• Evaluate the classifier constructed on the basis of onlythe smaller training set on the new test set
Example: Model Selection
Christof MonzData Minging - Week 1: Introduction
26
I We can refine• the estimation algorithm (e.g., using a classifier other
than the nearest neighbor classifier)
• the representation (e.g., base the summaries on adifferent set of courses)
• the assumptions (e.g., perhaps students work in groups)etc.
I We have to rely on the method of evaluatingthe accuracy of our predictions to select amongthe possible refinements
Types of Learning Approaches
Christof MonzData Minging - Week 1: Introduction
27
I Supervised learning: where we get a set oftraining inputs and outputs• E.g., classification, regression
I Unsupervised learning: where we areinterested in capturing inherent organization inthe data• E.g., clustering, density estimation
I Reinforcement learning: where we only getfeedback in the form of how well we are doing(not what we should be doing)• E.g., planning
Challenges of Data Mining
Christof MonzData Minging - Week 1: Introduction
28
I ScalabilityI Dimensionality/ComplexityI Data qualityI Data ownershipI Privacy considerationsI Continually updated data
Recap
Christof MonzData Minging - Week 1: Introduction
29
I Difference between data mining and otherresearch areas
I Applications of data miningI Need for automation and the use of machine
learningI Key steps in machine learning
About This Course
Christof MonzData Minging - Week 1: Introduction
30
I This course does not:• give a comprehensive introduction to data mining
• cover how to adapt data mining to specific applications
• cover feature extraction
• cover evaluation issues in detailI This course does:
• focus on the pre-dominant approach in data mining:machine learning
• sketch some of the example applications
• introduce a representative selection of machine learningtechniques used in data mining
• focus on the algorithmic fundamentals of machinelearning
Approaches Covered
Christof MonzData Minging - Week 1: Introduction
31
I Linear regression (regression)I Decision Trees (classification)I Neural Networks (classification)I k-Nearest-Neighbors (classification)I Naive Bayes (classification)I K-Means (clustering)I Hierarchical Clustering (clustering)
What to get out of this Course
Christof MonzData Minging - Week 1: Introduction
32
I At the end of this course you will have learned:• what type of problems can be addressed by data mining
techniques
• what the most common machine learning approaches indata mining are
• which machine learning approaches are appropriate for agiven type of data mining application
• the algorithmic fundamentals of a number of relevantmachine learning approaches
Course Administrivia
Christof MonzData Minging - Week 1: Introduction
33
I Exam counts for 40%, homework counts for20%, practical assignments (40%)
I Lectures are on Tuesday 9-11am (D1.116)Tutorials (werk colleges) are on Thursday9-11am (G0.05) and Fridays 9-11am (G5.29)Labs are on Thursday 11am-1pm (G0.18)or Friday 11am-1pm (G0.18)
Course Administrivia
Christof MonzData Minging - Week 1: Introduction
34
I Teaching assistants:Yijin He (email: [email protected])(English only!)Spyros Martzoukos (email:[email protected]) (English only!)
I Course web page: on BlackboardI Check course web page regularly for
announcements, slides, . . .