knowledge discovery and data mining i · ibig data sets cause prohibitively long runtime for data...

42
Ludwig-Maximilians-Universit¨ at M¨ unchen Lehrstuhl f¨ ur Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 2018/19

Upload: others

Post on 04-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Ludwig-Maximilians-Universitat MunchenLehrstuhl fur Datenbanksysteme und Data Mining

Prof. Dr. Thomas Seidl

Knowledge Discovery and Data Mining I

Winter Semester 2018/19

Page 2: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Agenda

1. Introduction

2. Basics2.1 Data Representation2.2 Data Reduction2.3 Visualization2.4 Privacy

3. Unsupervised Methods

4. Supervised Methods

5. Advanced Topics

Page 3: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Data Reduction

Why data reduction?

I Better perception of patternsI Raw (tabular) data is hard to understandI Visualization is limited to (hundreds of) thousands of objectsI Reduction of data may help to identify patterns

I Computational complexityI Big data sets cause prohibitively long runtime for data mining algorithmsI Reduced data sets are useful the more the algorithms produce (almost) the same

analytical results

How to approach data reduction?

I Data aggregation (basic statistics)I Data generalization (abstraction to higher levels)

Basics Data Reduction November 2, 2018 47

Page 4: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Data Reduction Strategies

ID A1 A2 A3

1 54 56 75

2 87 12 65

3 34 63 76

4 86 23 4

Numerosity ReductionReduce number of objects

Dimensionality ReductionReduce number of attributes

Quantization, DiscretizationReduce number of values per domain

ID A1 A3

1 L 75

3 XS 76

4 XL 4

Numerosity reduction

Reduce number of objects

I Sampling (loss of data)

I Aggregation (model parameters, e.g., center / spread)

Basics Data Reduction November 2, 2018 48

Page 5: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Data Reduction Strategies

Dimensionality reduction

Reduce number of attributes

I Linear methods: feature sub-selection, Principal Components Analysis, Randomprojections, Fourier transform, Wavelet transform

I Non-linear methods: Multidimensional scaling (force model)

Quantization, discretization

Reduce number of values per domain

I Binning (various types of histograms)

I Generalization along hierarchies (OLAP, attribute-oriented induction)

Basics Data Reduction November 2, 2018 49

Page 6: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Data Generalization

I Quantization is a special case of generalizationI E.g., group age (7 bits) to age range (4 bits)

I Dimensionality reduction is degenerate quantizationI Dropping age reduces 7 bits to zero bitsI Corresponds to generalization of age to ”all” = ”any age” = no information

all

Twen

... 2920

Teen

... 1913

Mid-age

...30

Basics Data Reduction November 2, 2018 50

Page 7: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Data Aggregation

I Aggregation is numerosity reduction (= less tuples)

I Generalization yields duplicates: Merge duplicate tuples and introduce (additional)counter attribute

Name Age Major

Ann 27 CS

Bob 26 CS

Eve 19 CS

Name Age Major

(any) Twen CS

(any) Twen CS

(any) Teen CS

Age Major Count

Twen CS 2

Teen CS 1

Generalization Aggregation

Basics Data Reduction November 2, 2018 51

Page 8: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Basic Aggregates

I Central tendency: Where is the data located? Where is it centered?I Examples: mean, median, mode, etc. (see below)

I Variation, spread: How much do the data deviate from the center?I Examples: variance / standard deviation, min-max-range, . . .

Examples

I Age of students is around 20

I Shoe size is centered around 40

I Recent dates are around 2020

I Average income is in the thousands

Basics Data Reduction November 2, 2018 52

Page 9: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Distributive Aggregate Measures

Distributive Measures

The result derived by applying the function to n aggregate values is the same as thatderived by applying the function on all the data without partitioning.

Examples

I count(D1 ∪ D2) = count(D1) + count(D2)

I sum(D1 ∪ D2) = sum(D1) + sum(D2)

I min(D1 ∪ D2) = min(min(D1),min(D2))

I max(D1 ∪ D2) = max(max(D1),max(D2))

Basics Data Reduction November 2, 2018 53

Page 10: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Algebraic Aggregate Measures

Algebraic Measures

Can be computed by an algebraic function with M arguments (where M is a boundedinteger), each of which is obtained by applying a distributive aggregate function.

Examples

I avg(D1 ∪ D2) =sum(D1 ∪ D2)

count(D1 ∪ D2)=

sum(D1) + sum(D2)

count(D1) + count(D2)

6= avg(avg(D1), avg(D2))

I standard deviation(D1 ∪ D2)

Basics Data Reduction November 2, 2018 54

Page 11: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Holistic Aggregate Measures

Holistic Measures

There is no constant bound on the storage size which is needed to determine/describea sub-aggregate.

Examples

I median: value in the middle of a sorted series of values (=50% quantile)

median(D1 ∪ D2) 6= simple function(median(D1),median(D2))

I mode: value that appears most often in a set of values

I rank: k-smallest / k-largest value (cf. quantiles, percentiles)

Basics Data Reduction November 2, 2018 55

Page 12: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Measuring the Central Tendency

Mean – (weighted) arithmetic mean

Well-known measure for central tendency (”average”).

x =1

n

n∑i=1

xi xw =

∑ni=1 wixi∑n

i=1 wi

Mid-range

Average of the largest and the smallest values in a data set:

(max + min)/2

I Algebraic measures

I Applicable to numerical data only (sum, scalar multiplication)

What about categorical data?

Basics Data Reduction November 2, 2018 56

Page 13: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Measuring the Central Tendency

Median

I Middle value if odd number of values

I For even number of values: average of the middle two values (numeric case), or one ofthe two middle values (non-numeric case)

I Applicable to ordinal data only (an ordering is required)

I Holistic measure

Examples

I never, never, never, rarely, rarely, often, usually, usually, always

I tiny, small, big, big, big, big, big, big, huge, huge

I tiny, tiny, small, medium, big, big, large, huge

What if there is no ordering?Basics Data Reduction November 2, 2018 57

Page 14: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Measuring the Central TendencyUnimodal

red green blue0.0

0.1

0.2

0.3

0.4

0.5

0.6

Bimodal

red green blue0.0

0.1

0.2

0.3

0.4

Mode

I Value that occurs most frequently in the data

I Example: blue, red, blue, yellow, green, blue, red

I Unimodal, bimodal, trimodal, . . . : There are 1, 2, 3, . . . modes in the data (multi-modalin general), cf. mixture models

I There is no mode if each data value occurs only once

I Well suited for categorical (i.e., non-numerical) data

Basics Data Reduction November 2, 2018 58

Page 15: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Measuring the Dispersion of Data

Variance

I Applicable to numerical data, scalable computation:

σ2 =1

n − 1

n∑i=1

(xi − x)2 =1

n − 1

n∑i=1

x2i −

1

n

(n∑

i=1

xi

)2

I Calculation by two passes: numerically much more stableI Single pass: calculate sum of squares and square of sum in parallel

I Measures the spread around the mean

I It is zero if and only if all the values are equal

I Standard deviation: Square root of the variance

I Both the standard deviation and the variance are algebraic

Basics Data Reduction November 2, 2018 59

Page 16: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Boxplot Analysis

Five-number summary of a distribution

I Minimum, Q1, Median, Q3, Maximum

I Represents 0%, 25%, 50%, 75%, 100%-quantile of the data

I Also called ”25-percentile”, etc.

Boxplot

I Boundaries: first and third quartiles

I Height: inter-quartile range (IQR)

I The median is marked by a line within the box

I Whiskers: minimum and maximum

I Outliers: usually values more than 1.5 · IQR below Q1 or above Q3

Basics Data Reduction November 2, 2018 60

Page 17: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Boxplot Example

setosa versicolor virginica

class

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

sepa

l len

gth

(cm

)

Iris Dataset

Basics Data Reduction November 2, 2018 61

Page 18: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Data Generalization

I Which partitions of the data to aggregate?I All data

I Overall mean, overall variance: too coarse (overgeneralized)

I Different techniques to form groups for aggregationI Binning – histograms, based on value rangesI Generalization – abstraction based on generalization hierarchiesI Clustering (see later) – based on object similarity

Basics Data Reduction November 2, 2018 62

Page 19: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Binning Techniques: Histograms

40 4

4

45 4

9

50 5

4

55 5

9

60 6

4

65 6

9

70 7

4

75 7

9

sepal length (mm)

0

5

10

15

20

25

30

Iris Dataset

I Histograms use binning to approximate data distributionsI Divide data into bins and store a representative (sum, average, median) for each

binI Popular data reduction and analysis methodI Related to quantization problems

Basics Data Reduction November 2, 2018 63

Page 20: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Equi-width Histograms

I Divide the range into N intervals of equal size: uniform grid

I If A and B are the lowest and highest values of the attribute, the width ofintervals will be (B − A)/N

Positive

I Most straightforward

Negative

I Outliers may dominatepresentation

I Skewed data is not handledwell

Basics Data Reduction November 2, 2018 64

Page 21: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Equi-width Histograms

Example

I Sorted data, 10 bins: 5, 7, 8, 8, 9, 11,13, 13, 14, 14, 14, 15, 17, 17, 17, 18,19, 23, 24, 25, 26, 26, 26, 27, 28, 32,34, 36, 37, 38, 39, 97

I Insert 999

0­9

10­1

9

20­2

9

30­3

9

40­4

9

50­5

9

60­6

9

70­7

9

80­8

9

90­9

9

0

2

4

6

8

10

12

0­99

100­

199

200­

299

300­

399

400­

499

500­

599

600­

699

700­

799

800­

899

900­

999

0

10

20

30

Basics Data Reduction November 2, 2018 65

Page 22: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Equi-height Histograms

Divide the range into N intervals, each containing approx. the same number ofsamples (quantile-based approach)

Positive

I Good data scaling

Negative

I If any value occurs often, theequal frequency criterionmight not be met (intervalshave to be disjoint!)

Basics Data Reduction November 2, 2018 66

Page 23: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Equi-height Histograms

Example

I Same data, 4 bins: 5, 7, 8, 8, 9, 11,13, 13, 14, 14, 14, 15, 17, 17, 17, 18,19, 23, 24, 25, 26, 26, 26, 27, 28, 32,34, 36, 37, 38, 39, 97

5­13

14­1

8

19­2

7

28­9

9

0

2

4

6

8

I Median = 50%-quantileI More robust against outliers (cf. value 999 from above)I Four bin example is strongly related to boxplot

Basics Data Reduction November 2, 2018 67

Page 24: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Concept Hierarchies: Examples

No (real) hierarchies

Set grouping hierarchies

Basics Data Reduction November 2, 2018 68

Page 25: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Concept Hierarchies: Examples

Schema hierarchies

Basics Data Reduction November 2, 2018 69

Page 26: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Concept Hierarchy for Categorical Data

I Concept hierarchies can be specified by experts or just by users

I Heuristically generate a hierarchy for a set of(related) attributesI based on the number of distinct values per

attribute in the attribute setI The attribute with the most distinct values is

placed at the lowest level of the hierarchy

country

province/state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values

I Fails for counter examples: 20 distinct years, 12 months, 7 days of week, but not”year < month < days of week” with the latter on top

Basics Data Reduction November 2, 2018 70

Page 27: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Summarization-based Aggregation

Data Generalization

A process which abstracts a large set of task-relevant data in a database from lowconceptual levels to higher ones.

1

2

3

4

all

federal states

states

countries

5 cities

conceptual levels example:

I Approaches:I Data-cube approach (OLAP / Roll-up) – manualI Attribute-oriented induction (AOI) – automated

Basics Data Reduction November 2, 2018 71

Page 28: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Basic OLAP Operations

Roll up

Summarize data by climbing up hierarchy or by dimension reduction.

Drill down

Reverse of roll-up. From higher level summary to lower level summary or detailed data,or introducing new dimensions.

Slice and dice

Selection on one (slice) or more (dice) dimensions.

Pivot (rotate)

Reorient the cube, visualization, 3D to series of 2D planes.

Basics Data Reduction November 2, 2018 72

Page 29: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Example: Roll up / Drill down

Query

SELECT ∗FROM b u s i n e s sGROUP BY count ry , q u a r t e r

Roll-Up

SELECT ∗FROM b u s i n e s sGROUP BY c o n t i n e n t , q u a r t e r

SELECT ∗FROM b u s i n e s sGROUP BY c o u n t r y

Drill-Down

SELECT ∗FROM b u s i n e s sGROUP BY c i t y , q u a r t e r

SELECT ∗FROM b u s i n e s sGROUP BY count ry , q u a r t e r , p r o d u c t

Basics Data Reduction November 2, 2018 73

Page 30: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Example: Roll up in a Data Cube

Roll Up=⇒

Basics Data Reduction November 2, 2018 74

Page 31: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Example: Slice Operation

SELECT incomeFROM time t , p r o d u c t p , c o u n t r y cWHERE p . name = ’VCR ’

VCR dimension is chosen

Basics Data Reduction November 2, 2018 75

Page 32: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Example: Dice Operation

SELECT incomeFROM time t , p r o d u c t p , c o u n t r y cWHERE p . name = ’VCR ’ OR p . name = ’PC ’ AND t . q u a r t e r BETWEEN 2 AND 3

sub-data cube over PC, VCR and quarters 2 and 3 is extracted

Dice=⇒

Basics Data Reduction November 2, 2018 76

Page 33: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Example: Pivot (rotate)

year 17 18 19

product TV PC VCR TV PC VCR TV PC VCR...

......

......

......

......

↓ Pivot (rotate) ↓

product TV PC VCR

year 17 18 19 17 18 19 17 18 19...

......

......

......

......

Basics Data Reduction November 2, 2018 77

Page 34: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Basic OLAP Operations

Other operations

I Drill across: involving (across) more than one fact table

I Drill through: through the bottom level of the cube to its back-end relationaltables (using SQL)

Basics Data Reduction November 2, 2018 78

Page 35: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Specifying Generalization by a Star-Net

I Each circle is called a footprint

I Footprints represent the granularities available for OLAP operations

CustomerOrders

Time

Location Organization

Product

Segment

Daily

District

Basics Data Reduction November 2, 2018 79

Page 36: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Discussion of OLAP-based Generalization

I StrengthI Efficient implementation of data generalizationI Computation of various kinds of measures, e.g., count, sum, average, maxI Generalization (and specialization) can be performed on a data cube by roll-up (and

drill-down)

I LimitationsI Handles only dimensions of simple non-numeric data and measures of simple

aggregated numeric valuesI Lack of intelligent analysis, can’t tell which dimensions should be used and what

levels the generalization should reach

Basics Data Reduction November 2, 2018 80

Page 37: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Attribute-Oriented Induction (AOI)

I Apply aggregation by merging identical, generalized tuples and accumulating theirrespective counts.

I Data focusing: task-relevant data, including dimensions, and the result is theinitial relation

I Generalization Plan: Perform generalization by either attribute removal orattribute generalization

Basics Data Reduction November 2, 2018 81

Page 38: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Attribute-Oriented Induction (AOI)

Attribute Removal

Remove attribute A if:

I there is a large set of distinct values for A but there is no generalization operator(concept hierarchy) on A, or

I A’s higher level concepts are expressed in terms of other attributes (e.g. street iscovered by city, state, country).

Attribute Generalization

If there is a large set of distinct values for A, and there exists a set of generalizationoperators (i.e., a concept hierarchy) on A, then select an operator and generalize A.

Basics Data Reduction November 2, 2018 82

Page 39: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Attribute Oriented Induction: Example

Name Gender Major Birth place Birth data Residence Phone GPAJim Woodman M CS Vancouver, BC,

Canada8-12-81 3511 Main St.,

Richmond687-4598 3.67

ScottLachance

M CS Montreal, Que,Canada

28-7-80 345 1st Ave.,Richmond

253-9106 3.70

Laura Lee F Physics Seattle, WA, USA 25-8-75 125 Austin Ave.,Burnaby

420-5232 3.83

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

I Name: large number of distinct values, no hierarchy – removed

I Gender: only two distinct values – retained

I Major: many values, hierarchy exists – generalized to Sci., Eng., Biz.

I Birth place: many values, hierarchy – generalized, e.g., to country

I Birth date: many values – generalized to age (or age range)

I Residence: many streets and numbers – generalized to city

I Phone number: many values, no hierarchy – removed

I Grade point avg (GPA): hierarchy exists – generalized to good, . . .

I Count: additional attribute to aggregate base tuples

Basics Data Reduction November 2, 2018 83

Page 40: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Attribute Oriented Induction: ExampleI Initial Relation:

Name Gender Major Birth place Birth data Residence Phone GPAJim Woodman M CS Vancouver, BC,

Canada8-12-81 3511 Main St.,

Richmond687-4598 3.67

ScottLachance

M CS Montreal, Que,Canada

28-7-80 345 1st Ave.,Richmond

253-9106 3.70

Laura Lee F Physics Seattle, WA, USA 25-8-75 125 Austin Ave.,Burnaby

420-5232 3.83

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

I Prime Generalized Relation:Gender Major Birth region Age Range Residence GPA Count

M Science Canada 20-25 Richmond Very good 16F Science Foreign 25-30 Burnaby Excellent 22

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

I Crosstab for generalized relation:Canada Foreign Total

M 16 14 30F 10 22 32

Total 26 36 62

Basics Data Reduction November 2, 2018 84

Page 41: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Attribute Generalization Control

I Problem: How many distinct values for an attribute?I Overgeneralization: values are too high-levelI Undergeneralization: level not sufficiently highI Both yield tuples of poor usefulness

I Two common approachesI Attribute-threshold control: default or user-specified, typically 2-8 valuesI Generalized relation threshold control: control the size of the final relation/rule, e.g.,

10-30

Basics Data Reduction November 2, 2018 85

Page 42: Knowledge Discovery and Data Mining I · IBig data sets cause prohibitively long runtime for data mining algorithms IReduced data sets are useful the more the algorithms produce (almost)

Next Attribute Selection Strategies for Generalization

I Aiming at minimal degree of generalizationI Choose attribute that reduces the number of tuples the mostI Useful heuristic: choose attribute with highest number of distinct values.

I Aiming at similar degree of generalization for all attributesI Choose the attribute currently having the least degree of generalization

I User-controlledI Domain experts may specify appropriate priorities for the selection of attributes

Basics Data Reduction November 2, 2018 86