mining approximate functional dependencies (afds) as condensed representations of association rules...

Mining Approximate Functional Dependencies (AFDs) as

Condensed Representations of Association Rules

Master’s Thesis Defenseby Aravind Krishna Kalavagattu

Committee Members:Dr. Subbarao Kambhampati (chair)Dr. Yi ChenDr. Huan Liu

Database Systems

• Well-defined schema and method for querying (SQL)

• Query optimization

• Lately, some systems started supporting IR-Style answering of user queries

Data mining

• Discovering useful patterns from data

• Rule learning is a well researched method for discovering interesting relations between variables in large databases

• Association Rules

Rule Mining with Several applicationsOver databases

Introduction to AFDs Approximate Functional Dependencies are rules denoting

approximate determinations at attribute level. AFDs are of the form (X ~~> Y), where X and Y are sets

of attributes X is the “determining set” and Y is called “dependent set” Rules with singleton dependent sets are of high interest

A classic example of an AFD (Nationality ~~> Language)

More examples Make ~~> Model (Job Title, Experience) ~~> Salary

Indicates that we can approximately guess the language of a person if we know which country she is from.

Introduction (contd..) Functional Dependency (FD)

Given a relation R, a set of attributes X in R is said to functionally determine another attribute Y, also in R, (written X → Y) if and only if each X value is associated with precisely one Y value.

AFDs can be loosely defined as FDs that approximately hold (there are some exception rows that fail to satisfy the Function over the current relation) Example: Make~~>Model (with error = 0.3)

70% of the tuples satisfy the dependency

Applications of AFDs

Predicting Missing Values of attributes

In relational tables(QPIAD)

Using values of attributes in determining set of AFD

Query Optimization(CORDS, BHUNT)

Maintaining correct selectivity estimates

Query Rewriting(AIMQ, QPIAD, QUIC)

Example: Model~~>BodyStyleRewrite query on Model=“RAV4” to Retrieve tuples with bodystyle=“SUV”

Database design (Database normalization)(Efficient Storage)Similar to the way FDs are used

FD Mining and Implications FD Mining aims at finding a minimal cover

Minimum set of FDs from which the entire set of FDs can be generated

Example: If A→B is an FD, then, ({A,C}→B) is considered redundant

Can we substitute this by generating only minimal dependencies in case of AFDs?

NO, because AFDs (Z~~>B) may be interesting for the application and we may prefer them to A~~>B.

Non-minimal dependencies perform better in QPIAD, QUIC etc

Example: AFD (JobTitle, Experience)~~>Salary Vs (JobTitle~~>Salary)

Performance Concerns

AFD Mining is costly The pruning strategies of FDs are not applicable in

case of AFDs. For datasets with large number of attributes, the

search space gets worse! Method for determining whether a dependency

holds or not is costly Way to traverse the search space is tricky

Bottom-up Vs Top-down ?

Quality Concerns Before algorithms for discovering AFDs can be developed,

AFDs need better Interestingness measures

AFDs used as feature selectors in classification are expected to give good Accuracy.

AFDs used in query rewriting are expected to give a high throughput per query.

(VIN~~>Make) Vs (Model~~>Make) (VIN~~>Make) looks good using the error metric But, intuitively (as well as practically) (Model~~>Make) is

a better AFD.

Challenges in AFD Mining

1. Defining right interestingness measures

2. Performing an efficient traversal in the search space of possible rules

3. Employing effective pruning strategies

Agenda/Outline Introduction Related Work Provide new perspective for AFDs

Roll-ups/condensed representations to association rules

Define measures for AFDs Present the AFDMiner algorithm Experimental Results

Performance Quality

Related WorkFD Mining Algorithms

•Aim at finding minimal cover•DepMiner, FUN, TANE, FD_Mine

Existing Approximation measures for AFDs•Tau, InD metrics

Grouping association rulesClustering association rules (v1~>u, v2~>u as (v1^v2~>u))

Do not work well for AFDs

•Metrics do not seem to matter in practice

•No accompanied algorithm to mine AFDs

No one combines them as AFDs

Existing AFD Miners

CORDS•SoftFDs (C1=>C2)•Uses |C1,C2|/|C1||C2| as the approximation measure

AIMQ/QPIAD/QUIC•TANE• Post-processing over TANE

•Restricted to singleton determining set•Works from a sample•Measure used is not appropriate

•Highly Inefficient•Quality of some AFDs is bad




Performance Quality

Condensing Association Rules

Viewing database relations as transactions Itemsets ≈attribute-value

pairs Association rules

Between Itemsets Beer~>Diapers

Here, they are between attribute value pairs

AFDs are rules between Attributes Corresponding to a lot of

association rules sharing the same attributes

Example

Example:

Association Rule: (Toyota, Camry)~>Sedan

Rolling up association rules as AFDs

Honda~~>Accord Toyota~~>Camry Tata~~>Maruti800… …

Make~~>Model

Confidence Consider an association rule of the form (α→β)

Confidence denotes the conditional probability of β (head) given α (body).

Similarly for an AFD (X~~>A), Confidence should denote the chance of finding the

values of A, given values of X Define AFD Confidence in terms of confidence of

association rules

Specifically, picking the best association rule for every distinct value-combination of the body of the association rule.

Confidence

For the example carDB, Confidence = Support (Make:Honda~~>Model:Accord) +

Support (Make:Toyota~~>Model:Camry) = 3/8+2/8 = 5/8

Interestingly this is equal to (1-g3) g3 has a natural interpretation as the fraction of tuples with

exceptions affecting the dependency.

Specificity For an association rule (α→β),

Support is the probability with which the conditioning event (i.e., α) occurs

Rule with High-Confidence, yet Low-Support is a bad rule!

Presence of a lot of association rules with low supports makes the AFD bad.

In classification, this affects prediction accuracy.

For query rewriting tasks, per-query throughput is less.

Types of AFDs

1. Model ~~> Make Few Branches - Uniform Distribution Good, and might hold good universally

2. VIN ~~> Make Many Branches - Uniform Distribution Bad - Confidence of each association rule is high,

but bad supports

3. Model, Location ~~> Price Many Branches - Skewed Distribution Few association rules with high support and

many with low support

Accord~~>Honda Camry~~>Toyota Maruti800~~>Tata… …

Model~~>Make

Specificity

The Specificity measure captures our intuition of different types of AFDs.

It is based on information entropy Higher the Specificity (above a threshold), worse the AFD is ! Shares similar motivations with the way SplitInfo is defined

in decision trees while computing Information Gain Ratio Follows Monotonicity

Normalized with the worst case Specificity i.e., X is a key




Performance Quality

AFD Mining Problem Good AFDs are the ones within the desired

thresholds of the Confidence and Specificity measures.

Formally, the AFD mining problem can be stated as follows:

AFD Mining The problem of AFD Mining is learn all AFDs

that hold over a given relational table

Two costs:1. Major cost is the Combinatoric cost of

traversing the search space2. Cost of visiting data to validate each rule

(To compute the interestingness measures)

Search process for AFDs is exponential in terms of the number of attributes

Pruning Strategies

1. Pruning by Specificity Specificity(Y) ≥ Specificity(X), where Y is a superset of X If Specificity(X) > maxSpecificity, we can prune all AFDs

with X and its supersets as the determining set2. Pruning (applicable to FDs)

If (X→A) is an FD, all AFDs of the form (Y→A) can be pruned

3. Pruning keys Needed for FDs But, this is subsumed by case 1 in AFDMiner

Because if Specificity(X) = 1, it means X is a key

AFDMiner algorithm Search starts from

singleton sets of attributes and works its way to larger attribute sets through the set containment lattice level by level.

When the algorithm is processing a set X, it tests AFDs of the form (X \{A})~~>A), where AєX.

Information from previous levels is captured by maintaining RHS+ Candidate Sets for each set.

Traversal in the Search Space During the bottom-up breadth-first search, the

stopping criteria at a node are:1. The AFD confidence becomes 1, and thus it is an FD. 2. The Specificity value of the X is greater than the max

value given.

FD based Pruning

Specificity based Pruning

Example:

A→C is an FD

Then, C is removed from RHS+(ABC)

Computing Confidence and Specificity

Methods are based on representing attribute sets by equivalence class partitions of the set of tuples

And, ∏X is the collection of equivalence classes of tuples for attribute set X

Example: ∏make = {{1, 2, 3, 4, 5}, {6, 7, 8}} ∏model = {{1, 2, 3}, {4, 5}, {6}, {7, 8}} ∏{make U model} = {{1, 2, 3}, {4, 5}, {6}, {7, 8}}

A functional dependency holds if ∏X = ∏XUA

For the AFD (X~~>A), Confidence = 1 – g3(X~~>A)In this example, Confidence(Model ~~>Make) = 1

Confidence(Make~~>Model) = 5/8

Algorithms Algorithm AFDMiner:

•Computes Confidence

•Applies FD-based pruning

Computes Specificity and applies pruning

•Computes level Ll+1

•Ll+1 contains only those attribute sets of size l+1 which have their subsets of size l in Ll




Performance Quality

Empirical Evaluation Experimental Setup

Data sets CensusDB (199523 tuples, 30 attrb) MushroomDB (8124 tuples, 23 attrb)

Parameters for AFDMiner minConf maxSpecificity No. of tuples No. of attributes MaxLength of determining set

Aim of the experiments is to show that the Dual-Measure approach (AFDMiner—using both confidence and specificity outperforms the Single-Measure approach (No_Specificity – that uses Confidence alone)

No_Specificity: A modified version of AFDMiner, which uses using only Confidence but not Specificity for AFDs. Thus, it generates all AFDs (X~~>A) with (Confidence(X~~>A) >minConf)

Evaluating Quality BestAFD:

The highest confident AFD among all the AFDs with attribute A as their dependent attribute

Classification Task: Classifier is run with determining set of

BestAFD as features Used 10-fold cross-validation and computed

the average classification accuracy Weka tool-kit

Evaluated over the censusDB

82

83

84

85

86

87

88

89

90

91

92

93

No_InfoSupport AFDMiner

Cla

ssif

icati

on

Accu

racy

Evaluation Quality

Average Classification accuracy for all attributesminConf = 0.8 ; maxSpecificity = 0.4

Shows that Specificity is effective in generating better quality AFDs.

No_Specificity

CensusDB

Choosing minConf !

Choosing maxSpecificity

Classification Accuracy (by varying maxSpecificity) threshold low => good rules are pruned threshold high => bad rules are not being pruned

Classification accuracy approximately forms a double elbow shaped curve.

0

1000

2000

3000

4000

5000

6000

0 0.2 0.4 0.6 0.8

InfoSupport

Tim

e T

aken

(m

s)

MaxSpecificityMaxSpecificityCensusDB CensusDB

Choosing maxSpecificity

Time to compute AFDs: Increases with increasing maxSpecificity Rate of change varies

A good threshold value for Specificity (i.e., maxSpecificity) is the value at the first elbow in the graph on quality

0

1000

2000

3000

4000

5000

6000

0 0.2 0.4 0.6 0.8

InfoSupport

Tim

e T

aken

(m

s)

Best Value

MaxSpecificityMaxSpecificity

Query Throughput

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

2 3 4 6 7 8 9 10 11 12 14 15 17 18 19 20 21 22 23 24

A ttrib u te s

No

of T

uple

s R

etri

eved

A F DMiner

No_InfoS upport

No. of tuples returned for an top-10 queries on each distinct determining set (denotes query throughput)

No_Specificity

Discussion on TANE

Primarily designed to generate FDs Modified version for generating

Approximate Dependencies

Uses the error metric g3 for AFDs Bottom-up search in the lattice

Generates only minimal dependencies Pruning applicable to FDs

Comparison (AFDMiner Vs TANE)

TANENOMINP is a modified version of TANE that does not stop with just minimal dependencies.

minConf is 0.8 (thus, we set the g3 to be 0.2)

AFDMiner outperforms both the approaches -- thus strengthening the argument that AFDs with high confidence and with reasonable Specificity are the best

Evaluating Performance

Time varies linearly with the number of tuples. AFDMiner takes less time compared to that of

NoSpecificity. Time varies exponentially on the number of

attributes. AFDMiner completes much faster than NoSpecificity

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 2000 4000 6000 8000 10000 12000

Number of Tuples

Tim

e T

aken

(m

s)

No_Specif icity

AFDMiner

CensusDB

0

5000

10000

15000

20000

25000

30000

35000

40000

0 5 10 15 20 25 30 35

No. of attributes

Tim

e t

aken

(m

s)

No_SpecificityAFDMiner

CensusDB

Evaluating Performance

0

1000

2000

3000

4000

5000

6000

0 2000 4000 6000 8000 10000

No. of Tuples

Tim

e T

aken

(ms)

No_Specificity

AFDMiner

0

1000

2000

3000

4000

5000

6000

0 5 10 15 20 25

No of attributes

Tim

e ta

ken

(ms)

No_Specificity

AFDMiner (ms)

0

20000

40000

6000080000

100000

120000

140000

160000

0 1 2 3 4 5 6 7

Length of determining set in each AFD

Nu

mb

er o

f ca

nd

idat

es

visi

ted

No_Specificity

AFDMiner

CensusDB

These experiments show that AFDMiner is fast

MushroomDB

0

5000

10000

15000

20000

25000

30000

0 1 2 3 4 5 6

Length of determining set in each AFD

Tim

e t

ak

en

(m

s)

No_Specif icity

AFD Miner

MushroomDB

CensusDB

Conclusion Introduced a novel perspective for AFDs

Condensed roll-ups of association rules.

Two metrics for AFDs Confidence Specificity

Algorithm AFDMiner all AFDs (confidence > minConf; Specificity < maxSpecificity) Bottom-up search in a breadth-first manner in the set

containment lattice of attributes Pruning based on Specificity

Experiments – AFDMiner generates high-quality AFDs faster. AFDs with high Confidence and reasonable Specificity

A version of this thesis is currently under review at ICDE’ 09

Future Direction Conditional Functional Dependencies (CFDs)

Dependencies of the form ({ZipCode→City} if country =”England”). i.e., Holding true only for certain values of one

or more of other attributes. CAFDs are the probabilistic counter part of CFDs CFDs and CAFDs are applied in data cleaning

and value prediction recently, but mining these

conditional rules is unexplored. Intuitively, CFDs are intermediate rules between association rules (value level) and FD (attribute level). So, we believe that our approach can help in generating them !

Questions ?

mining approximate functional dependencies (afds) as condensed representations of association rules...

Documents