cs513-data mining - lecture 2: understanding the...

32
CS513-Data Mining Lecture 2: Understanding the Data Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 1 / 32

Upload: others

Post on 23-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

CS513-Data MiningLecture 2: Understanding the Data

Waheed Noor

Computer Science and Information Technology,University of Balochistan,

Quetta, Pakistan

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 1 / 32

Page 2: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 2 / 32

Page 3: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 3 / 32

Page 4: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Pattern Example

ExampleConsider the data of contact lens prescription from an optician, thetask is to prescribe a soft, hard or no contact lens to the patient basedon his/her information. We will analyze past data in order to find somepatterns, if possible.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 4 / 32

Page 5: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Contact Lens Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 5 / 32

Page 6: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Finding Patterns: Illustration

if tear production rate = reduced thenrecommendation = noneelseif age = young and astigmatic=no thenrecommendation=softelserecommendation = hardend if

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 6 / 32

Page 7: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

What we get

These pattern may not be enough to be generalized as rule, sinceexample is a simple one and we do not have enough data. (i.e.,may be incomplete).We can say this pattern just summarizes the data.How many possible values of input required for extracting usefulpatterns? ( 3 × 2 × 2 × 2)Actually, the data mining task needs to generalize to newexamples as well.Real life data often contains examples in which values of somefeatures are noisy or missing.Which can effect the performance of data mining technique.Misclassification can even occur on the datasets that were used totrain/learn the method.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 7 / 32

Page 8: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 8 / 32

Page 9: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Weather Problem Example

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 9 / 32

Page 10: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Some Complexity: Numeric Attributes

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 10 / 32

Page 11: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Classification

What we have seen so far are classification rules, i.e., classifyingexamplesWe can also look examples for rules that associate values ofdifferent attributes, Association Rules.

Exampleif temperature = cool then humidity = normalif humidity = normal and windy = false then play = yesif windy = false and play = no then outlook = sunny and humidity = high

Can you identify one?

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 11 / 32

Page 12: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Rules

Definition (Rules)Set of conditions/decisions that can be specifically and implicitlyinterpreted in some order. They are helpful tools for makingclassification and association of examples. E.g., decision list, that isinterpreted in a sequence, or decision tree, that are interpretedhierarchically.

Sometime we may get a rule set that gives unique prescription forevery conceivable example, such as for above examplesHowever, it is generally not possible, there may be situation whereno rule is applicable or more than one rules are applicable (i.e.,conflict will rise then we go to probability or weigths)

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 12 / 32

Page 13: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 13 / 32

Page 14: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Types of Learning in Data Mining

Classification Learning: Learning is achieved by presenting classifiedexamples (historical/training data) in order to classifyunseen examples (future/test data).

Association Learning: Association among features is learned fromhistorical data. Here it is not just limited to learning forone particular attribute or feature.

Clustering: Examples are grouped together based on some similarityor homogeneity.

Numeric Prediction: The outcome to be predicted is not a discreteclass but the prediction is made for numerical outcome.

Definition (Concept)Any thing that is being learned is called the concept, and the output ofthe learning method is known as concept description.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 14 / 32

Page 15: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 15 / 32

Page 16: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Model Vs. PatternDefinition (Model)Describe global summary of the dataset, i.e., makes statement aboutany point in full measurement space. For example, predicting a valuesor assigning an example to the cluster. Even if some points in thisspace is missing.

Model RepresentationAt its simplest form, a model can be represented by:

Y = aX + c

where Y and X are variables (Y is outcome), and a and c are modelparameters.This is a linear model, since Y is a linear function of X a.

aUnlike Statistics, linearity here is in terms of variables rather than modelparameters

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 16 / 32

Page 17: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

PatternDefinitionDescribes a structure relating to a small parts (local) of data ormeasurement space. For example, mail order purchase data mayreveal a pattern that customers buying particular product also buy another product.

Example (Fraud Detection)Bank transaction data can be mined for fraud detection, once the usualbehaviors are described by patterns.

Once these structures are defined their parameters can beestimated from the data.Models or patterns with parameter values are called fitted modelsor patterns respectively.Fitted models or patterns are then used on future data.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 17 / 32

Page 18: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 18 / 32

Page 19: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Why Algorithms

We have seen that the data mining tasks rise in variety of differentreal world applicationsFor example, Exploratory data analysis, descriptive modeling,predictive modeling, patterns and rules discovery, contentsretrieval, and so on.To accomplish these tasks we need algorithms, termed as datamining algorithms

ReadingsYou should read about Real World Applications of Data Mining fromdifferent resources to build understanding of different types ofproblems and data mining tasks.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 19 / 32

Page 20: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Data Mining AlgorithmsNot very strict, generally there are four basic components of a datamining algorithm

Components1 Model or Pattern Structure: Describe the underlying structure or

functional forms that we seek from the data.2 Score Function: Also known as cost function, objective function

or performance measure, It is used to evaluate or judge thelearning capability and quality of the fitted structure (pattern ormodel).

3 Optimization or Searching: Optimizing the score function andsearching through different possible model and pattern structuresto find the best.

4 Data Management Strategy: Effective management of large dataduring optimization and searching.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 20 / 32

Page 21: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 21 / 32

Page 22: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Understanding Input

Definition (Example or Instance)A record or row in the data file is called an example or instance orobservation. They may have relationship among them or independentof each other in some way.

Definition (Attribute)The columns or fields of the data file that are fixed, predefined areknown as features or attributes. An instance characterizes the set ofattributes by its values. These attributes if selected or used for miningtask then they will be referred as variables for the data miningalgorithm.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 22 / 32

Page 23: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Types of Data

Quantitative Data: Numerical data, either continuous (e.g., Amount ofsales, temperature) or integer (e.g., number of students ina class)

Qualitative Data: That approximates or characterizes but does notmeasure, e.g., present or absent, level of agreement.

Categorical Data: That represents one of several (limited) categories,e.g., color of an object, gender of the customer etc. Theyare also some time called discrete as they representsome well separated categories.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 23 / 32

Page 24: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Measurement LevelsNominal : A variable can be treated as nominal when its values

represent categories with no intrinsic ranking (forexample, the departments of the company in which anemployee works). Examples of nominal variables includeregion, zip code, and religious affiliation.

Ordinal : A variable can be treated as ordinal when its valuesrepresent categories with some intrinsic ranking (forexample, levels of service satisfaction from highlydissatisfied to highly satisfied). Examples of ordinalvariables include attitude scores representing degree ofsatisfaction or confidence and preference rating scores.

Scale : A variable can be treated as scale when its valuesrepresent ordered categories with a meaningful metric, sothat distance comparisons between values areappropriate. Examples of scale variables include age inyears and income in thousands of dollars.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 24 / 32

Page 25: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Class Activity 1

Identify different types of data, and assign different measuring levels:

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 25 / 32

Page 26: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Class Activity 2

Identify different types of data, and assign different measuring levels:

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 26 / 32

Page 27: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Outline

1 PatternsClass Activity

2 Types of Learning

3 Model

4 Data Mining Algorithms

5 Understanding your Data: Input

6 Issues with Real World Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 27 / 32

Page 28: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Issues with Input Data

Due to many reasons real world data is sometime inaccurate, inexactor incomplete as apposed to the assumption of data mining algorithms.

Sparse DataMost attributes of the data may contain zero values, e.g., if a marketbasket data contains data of purchases by customers then for manyproducts that customer has not purchased, quantity will be zero.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 28 / 32

Page 29: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Missing Values

Respondent in a survey may refuse to answer few questions ormalfunction instrument may not record data for some attributes orvalues of some attributes in some circumstances may not bemeasured. These dataset will then contain missing values forspecific attributes.Missing Values may be represented in the dataset by anout-of-range value, or negative value if it is not possible for theattribute to have negative value, by a dash, question mark, etc.When collecting or recording data, one may not find an attributeuseful for their operation but that attribute may be important formining task, then we are faced with missing attributes.For example, university may not be interested in the parent’seducation or income but these attributed may have significancewhen mining students data for possible financial aid offer.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 29 / 32

Page 30: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Example of Missing Values

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 30 / 32

Page 31: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

Inaccurate Values

Since data for data mining task is not explicitly collected or recordedfor this purpose one should carefully analyze data for rogue attributesor attribute values.

Inaccuracy may occur:TypographyMeasurement ErrorMerging data from different sourcesDeliberately

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 31 / 32

Page 32: CS513-Data Mining - Lecture 2: Understanding the Datacsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-12.No-21.pdf · For example, university may not be interested in the parent’s

References I

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March 2016 32 / 32