knowledge discovery and data mining i · ibig data sets cause prohibitively long runtime for data...

Ludwig-Maximilians-Universitat MunchenLehrstuhl fur Datenbanksysteme und Data Mining

Prof. Dr. Thomas Seidl

Knowledge Discovery and Data Mining I

Winter Semester 2018/19

Agenda

1. Introduction

2. Basics2.1 Data Representation2.2 Data Reduction2.3 Visualization2.4 Privacy

3. Unsupervised Methods

4. Supervised Methods

5. Advanced Topics

Data Reduction

Why data reduction?

I Better perception of patternsI Raw (tabular) data is hard to understandI Visualization is limited to (hundreds of) thousands of objectsI Reduction of data may help to identify patterns

I Computational complexityI Big data sets cause prohibitively long runtime for data mining algorithmsI Reduced data sets are useful the more the algorithms produce (almost) the same

analytical results

How to approach data reduction?

I Data aggregation (basic statistics)I Data generalization (abstraction to higher levels)

Basics Data Reduction November 2, 2018 47

Data Reduction Strategies

ID A1 A2 A3

1 54 56 75

2 87 12 65

3 34 63 76

4 86 23 4

Numerosity ReductionReduce number of objects

Dimensionality ReductionReduce number of attributes

Quantization, DiscretizationReduce number of values per domain

ID A1 A3

1 L 75

3 XS 76

4 XL 4

Numerosity reduction

Reduce number of objects

I Sampling (loss of data)

I Aggregation (model parameters, e.g., center / spread)


Data Reduction Strategies

Dimensionality reduction

Reduce number of attributes

I Linear methods: feature sub-selection, Principal Components Analysis, Randomprojections, Fourier transform, Wavelet transform

I Non-linear methods: Multidimensional scaling (force model)

Quantization, discretization

Reduce number of values per domain

I Binning (various types of histograms)

I Generalization along hierarchies (OLAP, attribute-oriented induction)


Data Generalization

I Quantization is a special case of generalizationI E.g., group age (7 bits) to age range (4 bits)

I Dimensionality reduction is degenerate quantizationI Dropping age reduces 7 bits to zero bitsI Corresponds to generalization of age to ”all” = ”any age” = no information

all

Twen

... 2920

Teen

... 1913

Mid-age

...30


Data Aggregation

I Aggregation is numerosity reduction (= less tuples)

I Generalization yields duplicates: Merge duplicate tuples and introduce (additional)counter attribute

Name Age Major

Ann 27 CS

Bob 26 CS

Eve 19 CS

Name Age Major

(any) Twen CS

(any) Twen CS

(any) Teen CS

Age Major Count

Twen CS 2

Teen CS 1

Generalization Aggregation


Basic Aggregates

I Central tendency: Where is the data located? Where is it centered?I Examples: mean, median, mode, etc. (see below)

I Variation, spread: How much do the data deviate from the center?I Examples: variance / standard deviation, min-max-range, . . .

Examples

I Age of students is around 20

I Shoe size is centered around 40

I Recent dates are around 2020

I Average income is in the thousands


Distributive Aggregate Measures

Distributive Measures

The result derived by applying the function to n aggregate values is the same as thatderived by applying the function on all the data without partitioning.

Examples

I count(D1 ∪ D2) = count(D1) + count(D2)

I sum(D1 ∪ D2) = sum(D1) + sum(D2)

I min(D1 ∪ D2) = min(min(D1),min(D2))

I max(D1 ∪ D2) = max(max(D1),max(D2))


Algebraic Aggregate Measures

Algebraic Measures

Can be computed by an algebraic function with M arguments (where M is a boundedinteger), each of which is obtained by applying a distributive aggregate function.

Examples

I avg(D1 ∪ D2) =sum(D1 ∪ D2)

count(D1 ∪ D2)=

sum(D1) + sum(D2)

count(D1) + count(D2)

6= avg(avg(D1), avg(D2))

I standard deviation(D1 ∪ D2)


Holistic Aggregate Measures

Holistic Measures

There is no constant bound on the storage size which is needed to determine/describea sub-aggregate.

Examples

I median: value in the middle of a sorted series of values (=50% quantile)

median(D1 ∪ D2) 6= simple function(median(D1),median(D2))

I mode: value that appears most often in a set of values

I rank: k-smallest / k-largest value (cf. quantiles, percentiles)


Measuring the Central Tendency

Mean – (weighted) arithmetic mean

Well-known measure for central tendency (”average”).

x =1

n

n∑i=1

xi xw =

∑ni=1 wixi∑n

i=1 wi

Mid-range

Average of the largest and the smallest values in a data set:

(max + min)/2

I Algebraic measures

I Applicable to numerical data only (sum, scalar multiplication)

What about categorical data?


Measuring the Central Tendency

Median

I Middle value if odd number of values

I For even number of values: average of the middle two values (numeric case), or one ofthe two middle values (non-numeric case)

I Applicable to ordinal data only (an ordering is required)

I Holistic measure

Examples

I never, never, never, rarely, rarely, often, usually, usually, always

I tiny, small, big, big, big, big, big, big, huge, huge

I tiny, tiny, small, medium, big, big, large, huge

What if there is no ordering?Basics Data Reduction November 2, 2018 57

Measuring the Central TendencyUnimodal

red green blue0.0

0.1

0.2

0.3

0.4

0.5

0.6

Bimodal

red green blue0.0

0.1

0.2

0.3

0.4

Mode

I Value that occurs most frequently in the data

I Example: blue, red, blue, yellow, green, blue, red

I Unimodal, bimodal, trimodal, . . . : There are 1, 2, 3, . . . modes in the data (multi-modalin general), cf. mixture models

I There is no mode if each data value occurs only once

I Well suited for categorical (i.e., non-numerical) data


Measuring the Dispersion of Data

Variance

I Applicable to numerical data, scalable computation:

σ2 =1

n − 1

n∑i=1

(xi − x)2 =1

n − 1

n∑i=1

x2i −

1

n

(n∑

i=1

xi

)2

I Calculation by two passes: numerically much more stableI Single pass: calculate sum of squares and square of sum in parallel

I Measures the spread around the mean

I It is zero if and only if all the values are equal

I Standard deviation: Square root of the variance

I Both the standard deviation and the variance are algebraic


Boxplot Analysis

Five-number summary of a distribution

I Minimum, Q1, Median, Q3, Maximum

I Represents 0%, 25%, 50%, 75%, 100%-quantile of the data

I Also called ”25-percentile”, etc.

Boxplot

I Boundaries: first and third quartiles

I Height: inter-quartile range (IQR)

I The median is marked by a line within the box

I Whiskers: minimum and maximum

I Outliers: usually values more than 1.5 · IQR below Q1 or above Q3


Boxplot Example

setosa versicolor virginica

class

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

sepa

l len

gth

(cm

)

Iris Dataset


Data Generalization

I Which partitions of the data to aggregate?I All data

I Overall mean, overall variance: too coarse (overgeneralized)

I Different techniques to form groups for aggregationI Binning – histograms, based on value rangesI Generalization – abstraction based on generalization hierarchiesI Clustering (see later) – based on object similarity


Binning Techniques: Histograms

40 4

4

45 4

9

50 5

4

55 5

9

60 6

4

65 6

9

70 7

4

75 7

9

sepal length (mm)

0

5

10

15

20

25

30

Iris Dataset

I Histograms use binning to approximate data distributionsI Divide data into bins and store a representative (sum, average, median) for each

binI Popular data reduction and analysis methodI Related to quantization problems


Equi-width Histograms

I Divide the range into N intervals of equal size: uniform grid

I If A and B are the lowest and highest values of the attribute, the width ofintervals will be (B − A)/N

Positive

I Most straightforward

Negative

I Outliers may dominatepresentation

I Skewed data is not handledwell


Equi-width Histograms

Example

I Sorted data, 10 bins: 5, 7, 8, 8, 9, 11,13, 13, 14, 14, 14, 15, 17, 17, 17, 18,19, 23, 24, 25, 26, 26, 26, 27, 28, 32,34, 36, 37, 38, 39, 97

I Insert 999

09

101

9

202

9

303

9

404

9

505

9

606

9

707

9

808

9

909

9

0

2

4

6

8

10

12

099

100

199

200

299

300

399

400

499

500

599

600

699

700

799

800

899

900

999

0

10

20

30


Equi-height Histograms

Divide the range into N intervals, each containing approx. the same number ofsamples (quantile-based approach)

Positive

I Good data scaling

Negative

I If any value occurs often, theequal frequency criterionmight not be met (intervalshave to be disjoint!)


Equi-height Histograms

Example

I Same data, 4 bins: 5, 7, 8, 8, 9, 11,13, 13, 14, 14, 14, 15, 17, 17, 17, 18,19, 23, 24, 25, 26, 26, 26, 27, 28, 32,34, 36, 37, 38, 39, 97

513

141

8

192

7

289

9

0

2

4

6

8

I Median = 50%-quantileI More robust against outliers (cf. value 999 from above)I Four bin example is strongly related to boxplot


Concept Hierarchies: Examples

No (real) hierarchies

Set grouping hierarchies


Concept Hierarchies: Examples

Schema hierarchies


Concept Hierarchy for Categorical Data

I Concept hierarchies can be specified by experts or just by users

I Heuristically generate a hierarchy for a set of(related) attributesI based on the number of distinct values per

attribute in the attribute setI The attribute with the most distinct values is

placed at the lowest level of the hierarchy

country

province/state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values

I Fails for counter examples: 20 distinct years, 12 months, 7 days of week, but not”year < month < days of week” with the latter on top


Summarization-based Aggregation

Data Generalization

A process which abstracts a large set of task-relevant data in a database from lowconceptual levels to higher ones.

1

2

3

4

all

federal states

states

countries

5 cities

conceptual levels example:

I Approaches:I Data-cube approach (OLAP / Roll-up) – manualI Attribute-oriented induction (AOI) – automated


Basic OLAP Operations

Roll up

Summarize data by climbing up hierarchy or by dimension reduction.

Drill down

Reverse of roll-up. From higher level summary to lower level summary or detailed data,or introducing new dimensions.

Slice and dice

Selection on one (slice) or more (dice) dimensions.

Pivot (rotate)

Reorient the cube, visualization, 3D to series of 2D planes.


Example: Roll up / Drill down

Query

SELECT ∗FROM b u s i n e s sGROUP BY count ry , q u a r t e r

Roll-Up

SELECT ∗FROM b u s i n e s sGROUP BY c o n t i n e n t , q u a r t e r

SELECT ∗FROM b u s i n e s sGROUP BY c o u n t r y

Drill-Down

SELECT ∗FROM b u s i n e s sGROUP BY c i t y , q u a r t e r

SELECT ∗FROM b u s i n e s sGROUP BY count ry , q u a r t e r , p r o d u c t


Example: Roll up in a Data Cube

Roll Up=⇒


Example: Slice Operation

SELECT incomeFROM time t , p r o d u c t p , c o u n t r y cWHERE p . name = ’VCR ’

VCR dimension is chosen


Example: Dice Operation

SELECT incomeFROM time t , p r o d u c t p , c o u n t r y cWHERE p . name = ’VCR ’ OR p . name = ’PC ’ AND t . q u a r t e r BETWEEN 2 AND 3

sub-data cube over PC, VCR and quarters 2 and 3 is extracted

Dice=⇒


Example: Pivot (rotate)

year 17 18 19

product TV PC VCR TV PC VCR TV PC VCR...

......

......

......

......

↓ Pivot (rotate) ↓

product TV PC VCR

year 17 18 19 17 18 19 17 18 19...

......

......

......

......


Basic OLAP Operations

Other operations

I Drill across: involving (across) more than one fact table

I Drill through: through the bottom level of the cube to its back-end relationaltables (using SQL)


Specifying Generalization by a Star-Net

I Each circle is called a footprint

I Footprints represent the granularities available for OLAP operations

CustomerOrders

Time

Location Organization

Product

Segment

Daily

District


Discussion of OLAP-based Generalization

I StrengthI Efficient implementation of data generalizationI Computation of various kinds of measures, e.g., count, sum, average, maxI Generalization (and specialization) can be performed on a data cube by roll-up (and

drill-down)

I LimitationsI Handles only dimensions of simple non-numeric data and measures of simple

aggregated numeric valuesI Lack of intelligent analysis, can’t tell which dimensions should be used and what

levels the generalization should reach


Attribute-Oriented Induction (AOI)

I Apply aggregation by merging identical, generalized tuples and accumulating theirrespective counts.

I Data focusing: task-relevant data, including dimensions, and the result is theinitial relation

I Generalization Plan: Perform generalization by either attribute removal orattribute generalization


Attribute-Oriented Induction (AOI)

Attribute Removal

Remove attribute A if:

I there is a large set of distinct values for A but there is no generalization operator(concept hierarchy) on A, or

I A’s higher level concepts are expressed in terms of other attributes (e.g. street iscovered by city, state, country).

Attribute Generalization

If there is a large set of distinct values for A, and there exists a set of generalizationoperators (i.e., a concept hierarchy) on A, then select an operator and generalize A.


Attribute Oriented Induction: Example

Name Gender Major Birth place Birth data Residence Phone GPAJim Woodman M CS Vancouver, BC,

Canada8-12-81 3511 Main St.,

Richmond687-4598 3.67

ScottLachance

M CS Montreal, Que,Canada

28-7-80 345 1st Ave.,Richmond

253-9106 3.70

Laura Lee F Physics Seattle, WA, USA 25-8-75 125 Austin Ave.,Burnaby

420-5232 3.83

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

I Name: large number of distinct values, no hierarchy – removed

I Gender: only two distinct values – retained

I Major: many values, hierarchy exists – generalized to Sci., Eng., Biz.

I Birth place: many values, hierarchy – generalized, e.g., to country

I Birth date: many values – generalized to age (or age range)

I Residence: many streets and numbers – generalized to city

I Phone number: many values, no hierarchy – removed

I Grade point avg (GPA): hierarchy exists – generalized to good, . . .

I Count: additional attribute to aggregate base tuples


Attribute Oriented Induction: ExampleI Initial Relation:

Name Gender Major Birth place Birth data Residence Phone GPAJim Woodman M CS Vancouver, BC,

Canada8-12-81 3511 Main St.,

Richmond687-4598 3.67

ScottLachance

M CS Montreal, Que,Canada

28-7-80 345 1st Ave.,Richmond

253-9106 3.70

Laura Lee F Physics Seattle, WA, USA 25-8-75 125 Austin Ave.,Burnaby

420-5232 3.83

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

I Prime Generalized Relation:Gender Major Birth region Age Range Residence GPA Count

M Science Canada 20-25 Richmond Very good 16F Science Foreign 25-30 Burnaby Excellent 22

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

I Crosstab for generalized relation:Canada Foreign Total

M 16 14 30F 10 22 32

Total 26 36 62


Attribute Generalization Control

I Problem: How many distinct values for an attribute?I Overgeneralization: values are too high-levelI Undergeneralization: level not sufficiently highI Both yield tuples of poor usefulness

I Two common approachesI Attribute-threshold control: default or user-specified, typically 2-8 valuesI Generalized relation threshold control: control the size of the final relation/rule, e.g.,

10-30


Next Attribute Selection Strategies for Generalization

I Aiming at minimal degree of generalizationI Choose attribute that reduces the number of tuples the mostI Useful heuristic: choose attribute with highest number of distinct values.

I Aiming at similar degree of generalization for all attributesI Choose the attribute currently having the least degree of generalization

I User-controlledI Domain experts may specify appropriate priorities for the selection of attributes


knowledge discovery and data mining i · ibig data sets cause prohibitively long runtime for data...

Documents