knowledge discovery and data mining 1 (vo)...

65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge Discovery and Data Mining 1 (VO) (706.701) Denis Helic ISDS, TU Graz October 4, 2018 Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 1 / 62

Upload: others

Post on 25-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Knowledge Discovery and Data Mining 1 (VO)(706.701)

Denis Helic

ISDS, TU Graz

October 4, 2018

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 1 / 62

Page 2: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Lecturer

Name: Denis HelicOffice: ISDS, Petersgasse 116, Room 026

Office hours: Tuesday 12:00-13:00Phone: +43-316/873-30610email: [email protected]

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 2 / 62

Page 3: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Lecturer

Name: Roman KernOffice: Know-Center, Inffeldgasse 13, 6th Floor, Room 072

Office hours: By appointmentPhone: +43-316/873-30860email: [email protected]

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 3 / 62

Page 4: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language

Lectures in EnglishCommunication in German/EnglishExamination: German/English

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 4 / 62

Page 5: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outline

1 Welcome and Introduction

2 Course Organization

3 Motivation

4 Course Overview

5 Course Highlights

6 Practical Part: KDDM1 KU

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 5 / 62

Page 6: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Teaching Data Science @ ISDS

Introduction to Knowledge Technologies (new title: Introduction toAI & Data Science)Data Analysis Courses:

Knowledge Discovery and Data Mining 1 (Basics and theory)Knowledge Discovery and Data Mining 2 (Applications)Visual Analytics

Analysis of Web Systems & Data:Web Science (Basics)Network Science (Theory and applications)Complexity Science

Infrastructure:Databases (new Prof. and title: Data Management)New course: Architecture of Database SystemsNew course: Architecture of Large-Scale Machine Learning Systems

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 6 / 62

Page 7: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Course context

Knowledge Discovery and Data Mining 1 (VO) (706.701)Obligatory course Master Software Development and Business (1stSemester)Obligatory elective course in subject catalog “KnowledgeTechnologies” (Computer Science)

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 7 / 62

Page 8: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Course context

Knowledge Discovery and Data Mining 1 (KU) (706.702)Free course (but submitted for inclusion in the catalogs)You will get 1 ECTS now, 1.5 ECTS when it is included in the catalogAn add-on for the theoretical partHighly suggested

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 8 / 62

Page 9: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course

The overall goal of KDDM and related courses is to learn how todiscover patterns and models in data. We aim to discover patternsthat are:

i Valid: hold for new data with high probabilityii Useful: we can base further actions on themiii Unexpected: non-obviousiv Understandable: humans can interpret them

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 9 / 62

Page 10: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course: patterns example

1854 Broad Street cholera outbreakExtracting clusters of cholera outbreak in the city of London in 1854. Thecases clusterd around some intersections of roads in London. These hadcontaminated water wells.

http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 10 / 62

Page 11: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course: patterns example

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 11 / 62

Page 12: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course

Specific goals of this course are to learn about two out of three basicelements of that discovery:

i Tools: advanced mathematical tools from probability theory,linear algebra, information theory, and statistical inference

ii Infrastructure: models of computation for large data (handled in othercourses)

iii Process: steps that are needed to discover patternsI assume here that you already know

i How to program and develop softwareii Mathematical basics from probability theory and linear algebra

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 12 / 62

Page 13: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course

Student goals: to pass the examinationBonus goal for all: to have fun!

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 13 / 62

Page 14: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Calendar

04.10.2018: Course organization, Introduction and Motivation (Denis)11.10.2018: Preprocessing (Roman)18.10.2018: Feature Extraction (Roman)25.10.2018: Feature Engineering (Roman)

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 14 / 62

Page 15: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Calendar

30.10.2018: Data Matrices (Denis)08.11.2018: Review of Linear Algebra (Denis)15.11.2018: Partial Exam 1 / Project presentations (KU)22.11.2018: Principal Component Analysis (Denis)29.11.2018: Singular Value Decomposition (Denis)

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 15 / 62

Page 16: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Calendar

06.12.2018: Recommender Systems: Matrix Factorization (Denis)13.12.2018: Topic Modeling and Non-negative Matrix Factorization(Denis)10.01.2019: Classification (Denis)17.01.2019: Clustering (Roman)24.01.2019: Evaluation (Denis)31.01.2019: Partial Exam 2 / Project presentations (KU) / FinalExamination

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 16 / 62

Page 17: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Logistics

Course website:http://kti.tugraz.at/staff/denis/courses/kddm1Slides are (will be) available on the course websiteAdditional readings, references, links, etc. also on the websiteWe expect that you have basic knowledge in probability theory andlinear algebraTo freshen the knowledge you should solve these problems!This problem is not graded!As a side note: we also expect that you know how to program(relevant for the practical part)

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 17 / 62

Page 18: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Grading

Two partial examinationsFinal examination in JanuaryTwo additional examination dates in winter semesterThree examination dates in summer semesterExamination material: lectures/slides/further readingsIn class we will discuss sample examination questions

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 18 / 62

Page 19: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Partial Examinations

2 written examinationsIn the beginning of a lecture: max 45 minutesEach partial examination 2 questionsDifficulty adjusted to solve both problems in approx. 30 minutesMax 20 points for each questionTotal points: 80

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 19 / 62

Page 20: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Examination

Written examination 90 minutes4 questionsDifficulty adjusted to solve all four in approx. 60 minutesMax 20 points for each questionTotal points: 80

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 20 / 62

Page 21: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Important!

If you take partial examinations it counts as one examination attemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnline

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 21 / 62

Page 22: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Grading

0-40 points: 541-50 points: 451-60 points: 361-70 points: 271-80 points: 1

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 22 / 62

Page 23: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Important!

If you take partial examinations it counts as one examinationattemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnlineBad case scenario I: if you get 0 points at the first partialexamination you are negative!In that case you can take final examination in JanuaryAll together this will count as two attempts

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 23 / 62

Page 24: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Important!

Bad case scenario II: if you are negative after the second partialexamination you need to take the exam in summer termAll together this will count as two attemptsTherefore: take partial examination only if you follow the lecture andlearn in parallelThat is also the advantage: you will learn as you go and noteverything at once

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 24 / 62

Page 25: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

KU Organization

KU organization today after this lectureTwo presentations: after partial examinations

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 25 / 62

Page 26: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Questions?

Raise them now (+1 +1)Ask after the lecture (+1)Visit me in the office hours (+1)Send me an e-mail (±1)As a side note: you should(!) interrupt me immediately (+1 +1 +1)and ask any question you might have during the lecture

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 26 / 62

Page 27: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Motivation

How much information is being produced?

Study at Berkley: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/The World Wide Web contains about 170 terabytes of information onits surfaceThis is seventeen times the size of the Library of Congress printcollectionsInstant messaging generates five billion messages a dayThis is 274 terabytes a year

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 27 / 62

Page 28: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Motivation

How much information is being produced?

Email generates about 400,000 terabytes of new information eachyear worldwideP2P file exchange on the Internet is growing rapidlyThat was 2003, what do we have today?In 1993: 100.000G transferred over the Internet per year

In 2008: 100.000G transferred over the Internet in a second

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 28 / 62

Page 29: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Motivation

How much information is being produced?

Email generates about 400,000 terabytes of new information eachyear worldwideP2P file exchange on the Internet is growing rapidlyThat was 2003, what do we have today?In 1993: 100.000G transferred over the Internet per yearIn 2008: 100.000G transferred over the Internet in a second

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 28 / 62

Page 30: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Motivation

How much information is being produced?

Data, data everywhere:http://www.economist.com/node/15557443

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 29 / 62

Page 31: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Motivation

How much information is being produced?

We are producing more data than we are able to storeWe need to extract and describe useful dataUseful data ≪ all dataWe can store useful dataWe can also try to predict future data: store only prediction model

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 30 / 62

Page 32: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 31 / 62

Page 33: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Amazon product recommendationsi Valid: holds for new data with high probabilityii Useful: users can find and explore new productsiii Unexpected: non-obvious and non-trivialiv Understandable: related articles, etc.

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 32 / 62

Page 34: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 33 / 62

Page 35: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 34 / 62

Page 36: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Twitter earthquake and typhoon predictioni Valid: holds for new data with high probabilityii Useful: can save livesiii Unexpected: non-obvious and non-trivialiv Understandable: trajectories of typhoons, positions of earthquakes

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 35 / 62

Page 37: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Knowledge discovery vs. data mining

Knowledge discovery refers to the entire process, of which knowledgeis the end-productIt is iterative and interactiveData mining refers to a specific step in this processIt is the step consisting of applying data analysis and discoveryalgorithms that produce a particular enumeration of patterns overdataAdditional steps are necessary to ensure that the process producesuseful knowledge

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 36 / 62

Page 38: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

1 Developing an understanding of the application domain and therelevant prior knowledge and identifying the goal of the KDD processfrom the customers viewpoint

2 Creating a target data set: selecting a data set or focusing on asubset of variables or data samples on which discovery is to beperformed

3 Data cleaning and preprocessing: basic operations such as theremoval of noise. If appropriate collecting the necessary informationto model or account for noise, deciding on strategies for handlingmissing data fields, accounting for time sequence information andknown changes

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 37 / 62

Page 39: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

4 Data reduction and projection: finding useful features to representthe data depending on the goal of the task. Using dimensionalityreduction or transformation methods to reduce the effective numberof variables under consideration or to find invariant representations forthe data

5 Matching the goals of the KDD process step to a particular datamining method e.g. summarization, classification, regression,clustering, etc

6 Choosing the data mining algorithms: selecting methods to be usedfor searching for patterns in the data. This includes deciding whichmodels and parameters may be appropriate e.g. models forcategorical data are different than models on vectors over the reals.Matching a particular data mining method with the overall criteria ofthe KDD process e.g. the enduser may be more interested inunderstanding the model than its predictive capabilities

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 38 / 62

Page 40: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

7 Data mining searching for patterns of interest in a particularrepresentational form or a set of such representations, classificationrules or trees, regression, clustering and so forth. The user cansignificantly aid the data mining method by correctly performing thepreceding steps

8 Interpreting mined patterns: possibly return to any of the steps forfurther iteration. This step can also involve visualization of theextracted patterns, models or visualization of the data given theextracted models

9 Consolidating discovered knowledge: incorporating this knowledgeinto another system for further action or simply documenting it andreporting it to interested parties. This also includes checking for andresolving potential conflicts with previously believed or extractedknowledge

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 39 / 62

Page 41: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

Reading!Knowledge Discovery and Data Mining: Towards a Unifying Framework(1996) Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 40 / 62

Page 42: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 41 / 62

Page 43: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

1 Collect training data (e.g. crawl, clean, preprocess)2 Represent examples (e.g. decide which features, how to weight them,

etc.)3 Distance measure (e.g. what is close vs. what is not close)4 Measure the goodness (e.g. objective function)5 Select an approach (e.g. optimization method)

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 42 / 62

Page 44: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day Sunny Normal Humidity Strong Wind Temp. (°C) Play TennisDay1 Yes No Yes 12 NoDay2 No No No 18 NoDay3 Yes Yes No 21 YesDay4 Yes Yes No 28 YesDay5 Yes Yes No 19 ?

Table: Should I play tennis on Day5?

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 43 / 62

Page 45: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:

𝑑𝑝(𝑥, 𝑦) = (𝑛

∑𝑖=1

|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝

(1)

What is 𝑑2(𝑥, 𝑦)?

Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 44 / 62

Page 46: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:

𝑑𝑝(𝑥, 𝑦) = (𝑛

∑𝑖=1

|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝

(1)

What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)?

Manhattan distance

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 44 / 62

Page 47: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:

𝑑𝑝(𝑥, 𝑦) = (𝑛

∑𝑖=1

|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝

(1)

What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 44 / 62

Page 48: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 6.164414 9.11043358 16.0623784 7.14142843Day2 6.164414 0. 3.31662479 10.09950494 1.73205081Day3 9.11043358 3.31662479 0. 7. 2.Day4 16.0623784 10.09950494 7. 0. 9.Day5 7.14142843 1.73205081 2. 9. 0.

Table: Euclidean distances between days

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 45 / 62

Page 49: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 8. 11. 18. 9.Day2 8. 0. 5. 12. 3.Day3 11. 5. 0. 7. 2.Day4 18. 12. 7. 0. 9.Day5 9. 3. 2. 9. 0.

Table: Manhattan distances between days

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 46 / 62

Page 50: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 3.72174037 3.66840663 4.47507277 3.5008579Day2 3.72174037 0. 3.27940612 3.76436189 3.23329618Day3 3.66840663 3.27940612 0. 1.35622245 0.38749213Day4 4.47507277 3.76436189 1.35622245 0. 1.74371458Day5 3.5008579 3.23329618 0.38749213 1.74371458 0.

Table: Euclidean distances between days (standardized features)

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 47 / 62

Page 51: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?

Table: Should I play tennis on Day5?

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 48 / 62

Page 52: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 18.8 27.2 46.8 21.6Day2 18.8 0. 10.4 30. 4.8Day3 27.2 10.4 0. 19.6 5.6Day4 46.8 30. 19.6 0. 25.2Day5 21.6 4.8 5.6 25.2 0.

Table: Manhattan distances between days using Fahrenheit and Celsius

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 49 / 62

Page 53: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?

Table: Should I play tennis on Day5?

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 50 / 62

Page 54: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Big picture: KDDM

Probability Theory Linear Algebra Map-Reduce

Mathematical Tools Infrastructure

Knowledge Discovery Process

Information Theory Statistical Inference

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 51 / 62

Page 55: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Teaching Data Science @ ISDS and KDDM1

KDDM1: Basics, theory and KDD process until step 8 (novisualization, no interpretation)KDDM2: Implementation and practice of the theory from KDDM1Visual Analytics: KDD process steps 8 and 9 (interpretation andvisualization)Network Science: graph and networks mining

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 52 / 62

Page 56: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Data mining and other fields

Data mining overlaps withi Databases: Large-scale data, simple queriesii Machine learning: Small data, complex models, model parametersiii Statistics: Theory, predictive models, no algorithms

Data mining: Algorithms, simple and predictive models, large-scaledata

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 53 / 62

Page 57: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Data mining and other fields

Statistics Machine Learning

Databases

Data Mining

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 54 / 62

Page 58: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Highlights

KDDM1: some thoughts

Big data, data science, …Many buzzwordsBut in the end:

i You need to know how to program and how to develop software systemsii You need to understand the math behind data analysis: linear algebra,

probability, statistics

If you have solid knowledge in (i) and (ii) you are in the top 5% ofdevelopers in the field ;)For the years to come you will be earning a lot of money

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 55 / 62

Page 59: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Highlights

KDDM1: some highlights

Recommender systems: decompose the matrix and find out what yourusers like :)PCA, SVD: reduce the dimensionality of the dataNNMF: analyze relationships between set of documents with linearalgebraTopic models: analyze relationships between set of documents withprobability theoryBayesian inference

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 56 / 62

Page 60: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

Practical part: KDDM1 KU

Why should I be interested?To consolidate and reinforce your (theoretical) knowledge with practicalhands-on experienceHelps a lot with the partial and final examinationsGood preparation for KDDM2If interested: possibility to develop a topic for project or MSc Thesis

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 57 / 62

Page 61: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

Practical part: KDDM1 KU

You have to:Form small groups of up to four studentsDecide on an interesting practical or research questionDecide which data you need for that questionCrawl/download the data and work on your projectGive two presentations (in English) on the progress and your finalresultsEngage with the class and discuss the results of other groups

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 58 / 62

Page 62: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

Practical part: KDDM1 KU

Possible topics:Reasons of success of crowd funding campaigns?Movie/Game/Music RecommenderSentiment analysis of user posts on Reddit/StackOverflow, etc.Controversial topics in social media (clustering, prediction, etc.)Hate speech in social media (clustering, prediction, etc.)

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 59 / 62

Page 63: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

Practical part: KDDM1 KU

Presentations take place on 15.11.2018 and 31.01.2019 after thepartial/final examinationsFor the first presentation prepare three slides (5 min strict):

First slide: Research/Practical QuestionSecond slide: DatasetThird slide: Experimental Design

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 60 / 62

Page 64: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

Practical part: KDDM1 KU

For the second presentation prepare five slides (10 min strict):First slide: MotivationSecond slide: MethodologyThird slide: Experimental SetupFourth slide: ResultsFifth slide: Discussion

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 61 / 62

Page 65: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/intro.pdf · Knowledge discovery vs. data mining Knowledge discovery refers to the entire

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

Practical part: KDDM1 KU

GradingPresentation (time restrictions will be taken into account)Results

Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 62 / 62