Download - DAMI:&& Introduc.on&to&Data&Mining&
![Page 1: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/1.jpg)
DAMI:
Introduc.on to Data Mining
Panagio(s Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University
![Page 2: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/2.jpg)
Short Bio § BSc: University of Ioannina, Greece, 2003
![Page 3: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/3.jpg)
Short Bio § PhD: Boston University, USA, 2009
![Page 4: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/4.jpg)
Short Bio § 2009 -‐ 2012: Aalto University, Finland § Postdoc: Data Mining Group
![Page 5: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/5.jpg)
Short Bio § 2009 -‐ 2012: Aalto University, Finland § Postdoc: Data Mining Group
![Page 6: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/6.jpg)
Short Bio § 2012 -‐ 2013: Birkbeck, University of London, UK § Lecturer and director of the ITApps Programme
![Page 7: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/7.jpg)
Short Bio § September 2013: Senior Lecturer at DSV
!
![Page 8: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/8.jpg)
Course logisNcs • Course webpage:
– hPps://ilearn2.dsv.su.se/course/view.php?id=225
• Schedule: – Lectures: Nov 4 – Dec 17 – Exercise sessions: Nov 17, Dec 1, Dec 15 – WriPen Exam: Jan 14 – Re-‐exam: Feb 23
• Instructors: – PanagioNs Papapetrou: panagio([email protected] – Lars Asker: [email protected] – Henrik Boström: [email protected]
• Course Assistant: – Jing Zhao: [email protected]
• Office hours: by appointment only
![Page 9: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/9.jpg)
ILearn2
![Page 10: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/10.jpg)
Topics to be covered
• AssociaNon Rules • Clustering • Data RepresentaNon • ClassificaNon • Similarity Matching • Model EvaluaNon • Time Series Analysis • Ranking
![Page 11: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/11.jpg)
Syllabus Nov 4 Introduc.on to data mining
Nov 5 AssociaNon Rules
Nov 10, 14 Clustering and Data RepresentaNon
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 ClassificaNon
Nov 24, 26 Similarity Matching and Model EvaluaNon
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-‐EXAM
![Page 12: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/12.jpg)
Course workload
• Homeworks 3hp
• WriPen Exam 4.5hp
• Online quizzes
![Page 13: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/13.jpg)
Homework Assignments • Three assignments (30pts each, total 90pts) • 3-‐5 online quizzes (total of 10 + 20pts) • To be done individually • Will involve some programming in R • Three in-‐class exercise sessions • Submissions:
– Before each exercise session – No submissions allowed amer that!
• Grade scheme: A-‐F
![Page 14: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/14.jpg)
Quizzes
• 3-‐5 short online quizzes • Material to be examined
– The latest lecture • Available at the end of each lecture and to be completed before the next lecture
• Only one aPempt per quiz
• Will offer – 10pts towards the Homework Assignments – 20pts BONUS towards the Homework Assignments
• No make-‐up quizzes are possible
![Page 15: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/15.jpg)
To Pass the Course
• Pass the Homework Assignments
– at least 50/100 pts (including the BONUS pts from Quiz)
• Pass the WriPen Exam
– at least 50/100 pts
• Ask quesNons
• Enjoy it J
![Page 16: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/16.jpg)
Learning ObjecNves
• Become familiar with fundamental data mining algorithms
• Be able to idenNfy a correct algorithmic soluNon to a given data mining problem
• Be able to apply these algorithmic soluNons to solve pracNcal problems
• Be able to perform basic data mining tasks on real data using the R tool
![Page 17: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/17.jpg)
Textbooks Main:
Data Mining: PracNcal Machine Learning Tools and Techniques, Third EdiNon Publisher: Morgan Kaufmann Year: 2011 ISBN: 978-‐0123748560
Addi.onal: An IntroducNon to StaNsNcal Learning with applicaNons in R Publisher: Springer Year: 2013 ISBN: 978-‐1-‐4614-‐7138-‐7 URL: hPp://www-‐bcf.usc.edu/~gareth/ISL/
Research papers (pointers will be provided)
![Page 18: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/18.jpg)
Recommended prerequisites • Basic algorithms: sorNng, set manipulaNon, hashing • Analysis of algorithms: O-‐notaNon and its variants, NP-‐
hardness
• Programming: some programming knowledge, ability to do small experiments reasonably quickly
• Probability: concepts of probability and condiNonal probability, expectaNons, random walks
• Some linear algebra: e.g., eigenvector and eigenvalue computaNons
![Page 19: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/19.jpg)
Above all
• The goal of the course is to learn and enjoy
• The basic principle is to ask quesNons when you don’t understand
• Say when things are unclear; not everything can be clear from the beginning
• ParNcipate in the class as much as possible
![Page 20: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/20.jpg)
IntroducNon to data mining • Why do we need data analysis?
• What is data mining?
• Examples where data mining has been useful
• Data mining and other areas of computer science and mathemaNcs
• Some (basic) data mining tasks
![Page 21: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/21.jpg)
Why do we need data analysis
• Really really lots of raw data data!! – Moore’s law: more efficient processors, larger memories – CommunicaNons have improved too
– Measurement technologies have improved dramaNcally
– It is possible to store and collect lots of raw data
– The data analysis methods are lagging behind
• Need to analyze the raw data to extract knowledge
![Page 22: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/22.jpg)
The data is also very complex
• MulNple types of data: tables, Nme series, images, graphs, etc
• SpaNal and temporal aspects
• Large number of different variables
• Lots of observaNons à large datasets
![Page 23: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/23.jpg)
Example: transacNon data
• Billions of real-‐life customers: – COOP, ICA – Tele2
• Billions of online customers: – amazon – expedia
![Page 24: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/24.jpg)
Example: document data
• Web as a document repository: 50 billion of web pages
• Wikipedia: 4 million arNcles (and counNng)
• Online collecNons of scienNfic arNcles
![Page 25: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/25.jpg)
Example: network data • Web: 50 billion pages linked via hyperlinks
• Facebook: 200 million users
• MySpace: 300 million users
• Instant messenger: 1 billion users
• Blogs: 250 million blogs worldwide
![Page 26: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/26.jpg)
Example: genomic sequences
• hPp://www.1000genomes.org/page.php
• Full sequence of 1000 individuals
• 3 billion nucleoNdes per person • Lots more data in fact: medical history of the persons, gene expression data
![Page 27: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/27.jpg)
Example: environmental data • Climate data (just an example) hPp://www.ncdc.gov/oa/climate/ghcn-‐monthly/index.php
• “a database of temperature, precipitaNon and pressure records managed by the NaNonal ClimaNc Data Center, Arizona State University and the Carbon Dioxide InformaNon Analysis Center”
• “6000 temperature staNons, 7500 precipitaNon staNons, 2000 pressure staNons”
![Page 28: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/28.jpg)
We have large datasets…so what? • Goal: obtain useful knowledge from large masses of data
• “Data mining is the analysis of (omen large) observaNonal data sets to find unsuspected relaNonships and to summarize the data in novel ways that are both understandable and useful to the data analyst”
• Tell me something interesNng about the data; describe the data
![Page 29: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/29.jpg)
What can data-‐mining methods do?
• Extract frequent paPerns – There are lots of documents that contain the phrases “Stockholm”, “Housing” and “^#@$&^#$@”
• Extract associaNon rules – 80% of the ICA customers who buy beer and sausage also buy mustard
• Extract rules – If occupaNon = PhD student, then Salary < 30,000 SEK
![Page 30: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/30.jpg)
What can data-‐mining methods do?
• Rank web-‐query results – What are the most relevant web-‐pages to the query: “Student housing Stockholm University”?
• Find good recommendaNons for users – Recommend amazon customers new books – Recommend facebook users new friends/groups
• Find groups of enNNes that are similar (clustering) – Find groups of facebook users that have similar friends/interests
– Find groups amazon users that buy similar products – Find groups of ICA customers that buy similar products
![Page 31: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/31.jpg)
Goal of this course • Describe some problems that can be solved using data-‐mining methods
• Discuss the intuiNon behind data mining methods that solve these problems
• Illustrate the theoreNcal underpinnings of these methods (this is very important!!)
• Show how these methods can be real applicaNon scenarios (this is also very important!!)
![Page 32: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/32.jpg)
Data mining and related areas
• How does data mining relate to machine learning?
• How does data mining relate to staNsNcs?
• Other related areas?
![Page 33: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/33.jpg)
Data mining vs. machine learning • Machine learning methods are used for data mining – ClassificaNon, clustering
• Amount of data makes the difference – Data mining deals with much larger datasets and scalability becomes an issue
• Data mining has more modest goals – AutomaNng tedious discovery tasks – Helping users, not replacing them
![Page 34: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/34.jpg)
Data mining vs. staNsNcs • “tell me something interesNng about this data” – what else
is this than staNsNcs?
– The goal is similar
– Different types of methods
– In data mining one invesNgates lots of possible hypotheses
– Data mining is more exploratory data analysis
– In data mining there are much larger datasetsà algorithmics/scalability is an issue
![Page 35: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/35.jpg)
Data mining and databases
• Ordinary database usage: deducNve
• Knowledge discovery: inducNve • New requirements for database management systems
• Novel data structures, algorithms and architectures are needed
![Page 36: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/36.jpg)
Machine learning
The machine learning area deals with artificial systems that are able to improve their performance with experience
Supervised learning Experience: objects that have been assigned class labels Performance: typically concerns the ability to classify new (previously unseen) objects Unsupervised learning Experience: objects for which no class labels have been given Performance: typically concerns the ability to output useful characterizations (or groupings) of objects
Predictive data mining
Descriptive data mining
![Page 37: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/37.jpg)
• Email classificaNon (spam or not)
• Customer classificaNon (will leave or not)
• Credit card transacNons (fraud or not)
• Molecular properNes (toxic or not)
Examples of supervised learning
![Page 38: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/38.jpg)
Examples of unsupervised learning
• find useful email categories
• find interesNng purchase paPerns
• describe normal credit card transacNons
• find groups of molecules with similar properNes
![Page 39: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/39.jpg)
Data mining: input • Standard requirement: each case is represented by one row in one table • Possible addiNonal requirements
-‐ only numerical variables -‐ all variables have to be normalized -‐ only categorical variables -‐ no missing values
• Possible generalizaNons -‐ mulNple tables -‐ recursive data types (sequences, trees, etc.)
![Page 40: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/40.jpg)
An example: email classificaNon Features (aPributes)
Exam
ples (o
bservaNo
ns)
Ex. All
caps
No. excl.
marks
Missing
date
No. digits
in From:
Image
fraction
Spam
e1 yes 0 no 3 0 yes
e2 yes 3 no 0 0.2 yes
e3 no 0 no 0 1 no
e4 no 4 yes 4 0.5 yes
e5 yes 0 yes 2 0 no
e6 no 0 no 0 0 no
![Page 41: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/41.jpg)
Spam = yes Spam = no
Spam = yes
Data mining: output
![Page 42: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/42.jpg)
Data mining: output
![Page 43: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/43.jpg)
Data mining: output • Interpretable representaNon of findings
-‐ equaNons, rules, decision trees, clusters
321 1.32.25.425.0 xxxy +−+=
0.18.1&0.3 21 =≤> yxx then if
0.85]:Confidence0.05,:[SupportBuysJuicesBuysCereal&BuysMilk →
![Page 44: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/44.jpg)
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
![Page 45: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/45.jpg)
CRISP-‐DM: CRoss Industry Standard Process for Data Mining
Shearer C., “The CRISP-‐DM model: the new blueprint for data mining”, Journal of Data Warehousing 5 (2000) 13-‐22 (see also www.crisp-‐dm.org)
![Page 46: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/46.jpg)
CRISP-‐DM • Business Understanding
– understand the project objecNves and requirements from a business perspecNve
– convert this knowledge into a data mining problem definiNon
– create a preliminary plan to achieve the objecNves
![Page 47: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/47.jpg)
CRISP-‐DM • Data Understanding
– iniNal data collecNon – get familiar with the data – idenNfy data quality problems
– discover first insights – detect interesNng subsets
– form hypotheses for hidden informaNon
![Page 48: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/48.jpg)
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
![Page 49: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/49.jpg)
CRISP-‐DM • Data Prepara.on
– construct the final dataset to be fed into the machine learning algorithm
– tasks here include: table, record, and aPribute selecNon, data transformaNon and cleaning
![Page 50: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/50.jpg)
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
![Page 51: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/51.jpg)
CRISP-‐DM • Modeling
– various data mining techniques are selected and applied
– parameters are learned – some methods may have specific requirements on the form of input data
– going back to the data preparaNon phase may be needed
![Page 52: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/52.jpg)
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
![Page 53: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/53.jpg)
CRISP-‐DM • Evalua.on
– current model should have high quality from a data mining perspecNve
– before final deployment, it is important to test whether the model achieves all business objecNves
![Page 54: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/54.jpg)
CRISP-‐DM • Deployment
– just creaNng the model is not enough
– the new knowledge should be organized and presented in a usable way
– generate a report – implement a repeatable data mining process for the user or the analyst
![Page 55: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/55.jpg)
Tools
• Many data mining tools are freely available • Some opNons are:
Tool URL
WEKA www.cs.waikato.ac.nz/ml/weka/
Rule Discovery System www.compumine.com
R www.r-‐project.org/
RapidMiner rapid-‐i.com/
More options can be found at www.kdnuggets.com
![Page 56: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/56.jpg)
Some simple data-‐analysis tasks • Given a stream or set of numbers (idenNfiers, etc)
• How many numbers are there?
• How many disNnct numbers are there?
• What are the most frequent numbers?
• How many numbers appear at least K Nmes?
• How many numbers appear only once?
• etc
![Page 57: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/57.jpg)
Finding the majority element • Given a stream of labeled elements, e.g.,
{C, B, C, C, A, C, C, A, B, C}
• IdenNfy the majority element: element that occurs more than 50% of the Nme
• How can you find it? • … using no more than a few memory locaNons?
![Page 58: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/58.jpg)
CounNng sort • Given a stream of labeled elements, e.g.,
{C, B, C, C, A, C, C, A, B, C} • Count the number of objects that have each disNnct key value
• Complexity: O(N + k) – N: number of items – k: range of items (largest-‐smallest)
• May fail for small N << k
![Page 59: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/59.jpg)
Finding the majority element (Moore’s VoNng Algorithm)
• Complexity: O(N) – N: number of items
• Can we do bePer? – No! Unless we skip reading some items
![Page 60: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/60.jpg)
The Set Cover Problem
• A trickier data mining task…
• A common algorithmic problem…
• One of the MOST USEFUL problems in CS!
![Page 61: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/61.jpg)
The Set Cover Problem • The mayor of a city wants to place fire staNons so as to cover each neighborhood
• Each fire staNon covers: – own neighborhood – all adjacent ones
Challenge: • Where shall we place the fire staNons so as to minimize the city’s expenses?
• Each fire-‐staNon costs X SEK per month
![Page 62: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/62.jpg)
The Set Cover Problem
![Page 63: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/63.jpg)
The Set Cover Problem
• A set of objects • Some sets T that cover the objects
• Find the set of Ts that cover all objects!
• Find the smallest set!
![Page 64: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/64.jpg)
Formal DefiniNon
• SeVng: – Universe of N elements U = {U1,…,UN}
– A set of n sets T = {T1,…,Tn}
– Find a collecNon C of sets in T (C subset of T) such that C contains all elements from U
![Page 65: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/65.jpg)
• Set-‐cover problem: Find the smallest collecNon C of sets from T such that all elements in the universe U are covered
• SoluNon?
Formal DefiniNon
![Page 66: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/66.jpg)
Trivial algorithm
• Try all sub-‐collecNons of T
• Select the smallest one that covers all the elements in U
• The running Nme of the trivial algorithm is O(2|T||U|)
• This is way too slow
![Page 67: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/67.jpg)
Formal DefiniNon
• Set-‐cover problem: Find the smallest collecNon C of sets from T such that all elements in the universe U are covered
• The set cover problem is NP-‐hard
• Simple approxima(on algorithms with provable properNes are available and very useful in pracNce
![Page 68: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/68.jpg)
Greedy algorithm for set cover
• Select first the largest-‐cardinality set t from T
• Remove the elements of t from U
• Recompute the sizes of the remaining sets in T
• Go back to the first step
![Page 69: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/69.jpg)
The Greedy algorithm
• X = U • C = {} • while X is not empty do
– For all tєT let at=|t intersec.on X| – Let t be such that at is maximal – C = C U {t} – X = X\ t
![Page 70: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/70.jpg)
Recall… • We want to find a set of Ts such that we cover all the objects
• What would the greedy algorithm find?
![Page 71: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/71.jpg)
Example • Select biggest set: T1 • Remove all elements covered by T1
Current solu.on: X = {T1}
![Page 72: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/72.jpg)
Example • Select the next biggest set: T4 • Remove all elements covered by T4
Current solu.on: X = {T1}
![Page 73: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/73.jpg)
Example • Select the next biggest set: T5 • Remove all elements covered by T5
Current solu.on: X = {T1, T4}
![Page 74: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/74.jpg)
Example • Select the next biggest set: T5 • Remove all elements covered by T5
Current solu.on: X = {T1, T4, T5}
![Page 75: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/75.jpg)
Example • Select the next biggest set: T6 • Done!
Current solu.on: X = {T1, T4, T5, T6}
![Page 76: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/76.jpg)
Example • What is the opNmal soluNon?
• Recall: we want the smallest possible set!
Greedy solu.on: X = {T1, T4, T5, T6}
An op.mal solu.on: X* = {T3, T4, T5}
![Page 77: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/77.jpg)
How can this go wrong?
• No global consideraNon of how good or bad a selected set is going to be…
• How good is the proposed greedy algorithm?
![Page 78: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/78.jpg)
Do your best then.
NP-hardness
![Page 79: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/79.jpg)
Approximation Algorithms
Find an algorithm that will return solu(ons that are
guaranteed to be close to an op(mal solu(on
Constant factor approxima.on algorithms:
SOL <= f OPT
for some constant f
• OPT: value of an opNmal soluNon • SOL: value of the soluNon that our algorithm returns
![Page 80: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/80.jpg)
The key of designing a polytime approximation algorithm is to obtain a good (lower or upper) bound to the optimal solution
For an NP-hard problem, we cannot compute an optimal solution in polynomial time
The general strategy (for an optimization problem) is:
OPT SOL OPT ≤ SOL ≤ f · OPT, if f > 1
Approximation Algorithms
SOL f · OPT ≤ SOL ≤ OPT, if f < 1
minimizaNon maximizaNon
![Page 81: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/81.jpg)
How good is the greedy algorithm for the Set Cover Problem?
• Consider a soluNon I: – Let a(I) be the cost of the approximate soluNon – Let a*(I) be the cost of the opNmal soluNon – e.g., a*(I): is the minimum number of sets in S that cover all elements in U
• An algorithm for a minimizaNon problem has
approximaNon factor f if for all instances I we have that a(I) ≤ f a*(I)
![Page 82: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/82.jpg)
How about the set cover greedy algorithm?
• The greedy algorithm for set cover has approximaNon factors: – f = |smax|
• Proof: From CLR “IntroducNon to Algorithms”
• The set cover cannot be approximated with f becer than O (log |smax|)
• What does that mean?
![Page 83: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/83.jpg)
Today… • Why do we need data analysis?
• What is data mining?
• Examples where data mining has been useful
• Data mining and other areas of computer science and mathemaNcs
• Some (basic) data mining prototype problems
![Page 84: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/84.jpg)
Next Nme… Nov 4 IntroducNon to data mining
Nov 5 Associa.on Rules
Nov 10, 14 Clustering and Data RepresentaNon
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 ClassificaNon
Nov 24, 26 Similarity Matching and Model EvaluaNon
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-‐EXAM
![Page 85: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/85.jpg)
• AssociaNon rules
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Examples of association rules
{Diaper} → {Beer}, {Milk, Bread} → {Diaper,Coke}, {Beer, Bread} → {Milk},
Next Nme…
![Page 86: DAMI:&& Introduc.on&to&Data&Mining&](https://reader034.vdocument.in/reader034/viewer/2022042815/626917057c040b09790cebf7/html5/thumbnails/86.jpg)
TODOs
• Online R-‐tutorial: – Install R – Learn how to load files – Learn how to use the help command – Learn how to install packages – Learn how to print basic data staNsNcs
hPp://dist.stat.tamu.edu/pub/rvideos/