d iscovering t opical s tructures of d atabases professor: michalis petropoulos cse 705 megha ramesh...

DISCOVERING TOPICAL STRUCTURES OF

DATABASES

Professor: Michalis Petropoulos

CSE 705Megha Ramesh Kumar

Topics to be covered

Introduction Problem Definition The iDisc approach Handling Complex Aggregations Finding Cluster Representatives Empirical Evaluation Related Work Conclusion

Overview

Discovering topical structures of databases to support semantic browsing and large scale data integration.

iDisc - multi-strategy learning framework. It exploits instance values to construct multiple

database representations. It employs a set of base clusterers to discover

preliminary topical clusters of tables from database representations & aggregates them to final clusters via meta-clustering.

Difficulties in Integrating Databases I

Documentation and metadata for enterprise databases are often scattered throughout the IT department.

They are incomplete, inaccurate or missing. Scale of database + lack of documentation. Cost of integrating databases increases. Designers left the company but did not leave any

design documents. Implementation of databases might not be consistent

with the design.

Reverse Engineer and integrating the databases becomes difficult.

Key step – Identify the semantic correspondences (mappings) among attributes from different databases.

Existing matching solutions attempt to find mappings between every two attributes.

Assumptions made: Databases are small. Attributes in one database are potentially relevant

to all attributes in another database.

Difficulties in Integrating Databases II

Without Topical Structures

With Topical Structures

Gist

We formally define the problem of discovering topical structures of databases and demonstrate how they can support semantic browsing & large scale data integration.

We propose a novel multi- strategy discovery framework & describe the iDisc system which realizes this framework.

We propose a novel clustering aggregation techniques to address limitations of existing solutions.

We propose a new approach to finding representative tables using a novel measure on table importance.

iDisc is evaluated over real-world databases, and results indicates that it discovers topical clusters with a high level of accuracy.

Problem Definition I

Topical RelationshipConsider database D, set of tables where each table is associated with a topic p ∈ P, topic(T) = p.

There exists a topical relationship between tables S and T if topic(T) = topic(S) [Denoted as ρ(S, T)].

Example:Suppose P = {Invoice, Shipment, Product}

ρ(InvoiceItem, InvoiceTerm) since

topic(InvoiceItem) = topic(InvoiceTerm) Properties:

ρ is transitive, reflexive, symmetric. ρ defines an equivalence class.

(mutually exclusive)

Problem Definition II

Topical StructureDescribes how the tables in a database are grouped based on their topical relationship.

Consider set of topics P, database D, topical relationship ρ between tables in D wrt P.

Topical structure of InvDB with respect to P is {C1,C2 , C3}, where C1 = {InvoiceStatus, InvoiceTerm, Invoice, InvoiceItem} C2 = {Shipment, ShipmentMethod} C3 = {Product, ProductCategory,Category}

Problem DefinitionGiven a database D with a set of tables, discover:

A set of topics P, which the tables in D are about; The topical structure of D with respect to P, in the form of a

partition C = {C1,C2, ..., Ck} over the tables in D, where k=|P|.

The iDisc Approach

Model Builder

The Model Builder examines database D with a set of tables from a number of perspectives and obtains a variety of representations for D.

It constructs varied representations for the database from its schema information and instance data.

These representations fall into three categories: Vector–Based Graph–Based Similarity–Based

Vector – Based Representations

It captures the topical structure of a database via the descriptive information on the tables.

Table – Text document Database – Collection of documents Structures of tables are ignored. Different ways of constructing these documents

results in different representations of database. Suppose number of unique tokens among documents

for tables in database D is n: Each document d – n-dimensional vector <w1 , w2, …

wn> i-th dimention - i-th unique token in D. wi - weight of token for document d. Many weighting functions – e.g.: TF * IDF weight.

InvDB- An invoice management database

Graph – Based Representations

It captures the topical structure of a database via the linkage among the tables.

Nodes – Table Edges – Linkage between tables. iDisc discovers primary keys and then discovers foreign

keys. Due to cost of enforcing constraints information on keys

and foreign keys is often missing in the catalogs. Consider key-attribute A in T1 & attribute B in T2

|B ∩ A| = |B| |B|> 0.8 |A| |B|> 2 NameSim ( A , B ) > 0.5; NameSim is a measure on

similarity of attribute names.

Similarity – Based Representations I

It captures the topical structure of a database via the value-based similarity between the tables.

Idea: If two tables are about the same topic , they may have several attributes containing similar values.

|D|× |D| matrix M |D| is the no. of tables in D M[i,j] stores the similarity between i-th and j-th tables in D

Different ways of evaluating similarity results in different representations for the database.

iDisc procedure: Evaluate value similarity between attributes. Discover matching attributes. Evaluate table similarity.

Similarity – Based Representations II

Evaluate value similarity between attributes. For every two attributes X and Y one from each table, compute the

similarity as the jaccard similarity between the sets of values in X and Y.

Discover matching attributes. (Greedy-matching)Let Z = ∅, U = all attributes in T, and V = all attributes in T’.Find u ∈ U and v ∈ V such that they have a maximum (positive)

similarity among all pairs of attributes from U and V.Add attribute pair (u, v) to Z, remove u from U and v from V.Repeat steps 2 and 3 until no more such pairs can be found. Consider tables T = InvoiceStatus & T’ = InvoiceTerm

J(T.InvoiceID, T’.InvoiceID) = 0.75

J(T.InvoiceID, T’.TermType) = 0.2

J(T.StatusCode,T’.TermType) = 0.15

Similarity – Based Representations III

Evaluate table similarity Similarity between T and T ‘

where |T| - number of attributes in T.

Example: Sim(InvoiceStatus, InvoiceTerm) = (0.75 + 0.15)/2

iDisc’s goal is not to build best models (which typically do not exist), but to show that it can produce a better solution by building & combining many different (possibly imperfect) models.

Base Clusteres

Take a database representation and discover preliminary clustering over the tables in the database.

iDisc first implements several generic clustering algorithms and then instantiates them into clusters.

Generic Algorithms: Similarity-based Linkage-based.

Instantiation Vector based representation. Graph based representation. Similarity based representation.

Generic Similarity-Based Algorithm I SimClust(T, M, ClsrSim, Q) → C Input: T , a set of table {T1, T2, ...,T|T|}

M, a similarity matrix for the tables in T ClsrSim, a cluster similarity function Q, a clustering quality metric

Output: C, a partition of tables in T Set up initial clusters:

Let i = 1 Let C1 = {{T1}, {T2}, ..., {T|T|}}

Repeat until |Ci| = 1 Evaluate the quality of Ci via Q Evaluate the similarities of clusters in Ci via ClsrSim Find Cx, Cy ∈ Ci with a maximum similarity Merge clusters Cx and Cy i ← i + 1

Return Ci with a maximum Q value

Generic Similarity-Based Algorithm II

ClsrSim is a cluster similarity function which takes the similarity matrix M and two clusters of tables, Cx and Cy, and computes a similarity value between Cx and Cy.

Implementations of ClsrSim: single-link - We merge in each step the two clusters

whose two closest members have the smallest distance (or the two clusters with the smallest minimum pair-wise distance).

complete-link - We merge in each step the two clusters whose merger has the smallest diameter (or the two clusters with the smallest maximum pair-wise distance).

average-link- Is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains.

Generic Similarity-Based Algorithm III

Q is a metric for evaluating the quality of clusterings. Elbow criterion Gap statistics Cross-validation

But there is no best solution.

Q(C) = ∑Ci∈C|Ci|/N ∗ (IntraSim(Ci) − InterSim(Ci))

N is the total number of tables in the database. |Ci| is the number of tables in cluster Ci∈ C.

IntraSim(Ci) is the average similarity of tables within the cluster Ci

InterSim(Ci) is the maximum similarity of Ci with any other cluster in C.

Generic Linkage-Based Algorithm

LinkClust(T, G, EdgeDel, Q’) → C Input: set T of tables;

G, a linkage graph for the tables in T EdgeDel, a function that suggests edges to be removed Q’, a clustering quality metric

Output: C, a partition of tables in T Let i = 1 Repeat until G has no edges

Let Ci = connected components in G Evaluate the quality of Ci via Q’ Let Ec = EdgeDel(G) Remove edges in Ec from G i ← i + 1

Return Ci with a maximum Q value.

Shortest – path betweenness (SP) First, find the shortest paths path between vertices and

then measure the betweenness β(e) of an edge by the fraction of the shortest path that contain the edge.

β(e) =∑s,tV, s≠ tσst(e)/σst

σst - Number of distinct shortest paths between vertices s and

t. σst(e) is the number of distinct shortest paths between s and t

that contain the edge e. EdgeDel(G) then returns an edge with a maximum β value.

Spectral Graph Partitioning (SPC) EdgeDel returns an edge-cut of G, which comprises a set

of edges which are likely lying between two clusters. Consider G’s Laplacian matrix LG = DG–AG

DG is a diagonal matrix whose entry D[i, i] is the degree of i-th vertex in G.

AG is G’s adjacency matrix.

Then it can be shown that finding a minimum edge-cut of G corresponds to finding the smallest positive eigenvalue λ2 of LG.

The eigenvector for λ2 suggests a possible bi-partitioning of the vertices in G, where the vertices with positive values are placed in one cluster and the vertices with negative values in the other cluster.

Metric Q’ It measures the clusterings in LINKCLUST. It captures the intuition that a good partition of the network

should be such that the nodes within a community are well-connected while there are only a few edges connecting different communities.

|E|- total number of edges in graph. |Eii|-number of edges connecting two vertices both in the Ci. |Ei|-number of edges that are incident to at least one vertex

in Ci. |Eii|/|E| is the observed probability that an edge falls into the

cluster Ci. (|Ei|/|E|)2 is the expected probability under the assumption

that the connections between vertices are random.

Generating Base Clusters

Graph- based representations iDisc generates base clusters

by instantiating LINKCLUST. If input is directed graph, it first

transforms it into undirected graph by ignoring the direction of the edges

For example: LinkClust(T, G, SP, default_Q’)

Generating Base Clusters

Vector- based representations iDisc generates base clusters by instantiating SIM-CLUST. Consider database D, with tables T= {T1, T2….. T|T|} &

token vector for table Ti as Ti^ .

For every two tables, we evaluate similarity by a variety of methods. Eg. Cosine function

Cos(Ti^, Tj

^)= Ti^. Tj

^ / (║Ti^║, ║Tj

^║) SimClust(T, M, single-link, default_Q)

Similarity- based representations Similar to vector-based representation, iDisc generates

base clusters by instantiating SIM-CLUST. Difference is similarity matrix in the representation is

directly used for instantiation.

Meta- clusterer I

Given a set of preliminary clusterings C from base clusters, goal of meta cluster is to find a clustering C’, such that it agrees with the clusterings in C as much as possible.

Disagreements among C and C’ is d(C, C’). B1

Vector-based representation Complete- link

B2 Vector-based representation Single-link

B3 Linkage-based representation

Meta- clusterer II

iDisc implements a clustering like voting scheme, but with a difference that it automatically determines an appropriate number of clusters based on particular votes from the input clusterers.

The problem of finding the best aggregated clustering can shown as NP- complete. Most of the approximation algorithms are based on majority vote scheme.

The meta-clustering algorithm is also based on the voting scheme but with a key difference. It does not assume an explicit clustering, but the algorithm

automatically determines an approximate number of clusters in the aggregated clustering based on particular votes from the input clusterers.

Meta- clusterer III

Algorithm has two phases: Vote-based similarity evaluation

Consider two tables T, T’ & clustering Ci , a vote from Ci takes value1 if T and T’ are placed in the same clustering Ci & 0 otherwise.

Based on the votes from the base clusterers, the similarity between two tables T, T ∈ T is computed as:

1/m· ∑i=1…mVT,T’(Ci) where m is the number of base clusterers.

Vote- based similarity

Meta- clusterer IV

Re-clustering A similarity matrix Mv is constructed from the previous step. iDisc generates the meta-clusterer as

SimClust(T, Mv, single-link, default_Q)

Research however focused mostly on combining different clustering algorithms (single link vs. complete link) and not different representation models.

Handling Complex Aggregations I

The meta-clusterer has to identify and remove the errors in the input clusterers and combine the strength of different clusterers to produce better clusters.

All input clusterers are treated as being equally good by the meta-clusterer.

However, the performance of clusterers may vary a lot depending on characteristics of particular data set ie. the same clusterer might perform well on data set but perform poorly on another.

Thus, we dynamically adjust the weights of clusterers so that the better performing clusterers are weighted more.

Handling Complex Aggregations II

We use clusterer boosting approach which first estimates the performance of a clusterer by comparing it to other clusterers which are likely to be accurate.

The results from the clusterers are re-aggregated based on the new weights.

Clusterer boosting involves the following steps. Determining a pseudo- solution. Ranking input clusteres. Adjusting weights.

Handling Complex Aggregations III

Aggregation tree H Level of aggregation –

depth of deepest internal node in H.

Single level clustering: Base clusters are

aggregated at once by a single meta-clusterer.

Multiple level clustering Multiple meta-clusterers,

a some take input as aggregated clustering from previous meta-clusterers.

Handling Complex Aggregation IV

Similarity levels:1. Clusterers which take the same representation (e.g., a

vector-based representation),but employ different clustering algorithms (e.g., single-link vs. complete-link versions of the similarity-based algorithm).

2. Clusterers which take the same kind of representations (e.g., a vector-based representation constructed from table names vs. a vector-based representation constructed from both table & attribute names)

3. Clusterers which take different kinds of representations (e.g., a vector-based vs. a graph-based representation). Furthermore, if one of the clusterers is a meta-clusterer, their similarity level is given by the least similarity level among all the base clusterers.

Handling Complex Aggregation V

Tree construction1. Initialize a set W of current clusterers with all the base

clusterers.2. Determine the maximum similarity level l among all the

clusterers in W.3. Find a set S of all clusterers with the similarity level l. 4. Aggregate the clusterers in S using a meta-clusterer M and

remove them from W. Add M into W. 5. Repeat steps 2–4 until there is only one clusterer left in W,

which is the root meta-clusterer.

Finding Cluster Representatives I

There are large number of tables on the same topic. We need to know the important tables.

These tables are cluster representatives. They serve as entry points to a cluster and gives users a

general idea of what the cluster is about.

iDisc – Representative Finder which discovers representative tables on the basis of their importance.

Observation: A table that is important should be at the focal point in the linkage graph for the cluster.

Hence , iDisc measures the importance of a table based on its centrality on the linkage graph.

Finding Cluster Representatives II

Given a linkage graph G(V, E), the centrality of a vertex v Є V, denoted as ζ(v), is computed as follows:

where: σst : The number of distinct shortest paths between vertices

s and t. σst (v) : The number of shortest paths between s and t that

pass through the vertex v.

Finding Cluster Representatives III

Representative Discovery (REPDISC) Input :

Clustering C = {C1, C2,….. Ck} over database D. Linkage graph G of D.

Output: A desired number ‘ r ’ which is the number of representative tables in

each cluster C.

Obtain linkage graph Gci, a

subgraph of G induced by the set of

tables in Ci.

Evaluate centrality scores. Rank the tables by descending

order of their centrality scores and

return top r tables in the ranked

list.

Finding Cluster Representatives IV

Complexity of REPDISC: Consider cluster Ci ∈ C and denote the induced graph for Ci as

G(Vr,Er) where Vr is a set of tables in Ci and Er is a set of linkage edges

between the tables in Ci.

The time to create the graph is O(|Vr| + |Er|).

1. For every two tables in Vr, determine if there is an edge between them. Suppose G is implemented with an adjacency matrix. This can be done in O(|Vr|2). Thus, the overall complexity of step 1 is O(|Vr|2) .

2. The complexity can be shown to be O(|Vr| ∗ |Er|).

3. The complexity is O(|Vr|).

So the overall complexity for steps 1–3 is O(|Vr| ∗ |Er|), with the dominant factor being the time for step 2.

Empirical EvaluationExperiment Setup

Data Set HR1- engagement management, HR2 – skill development,

HR3 – invoice tracking. Determine

Set of topics in database. Which topic each table in the database is about. These were used as gold standard for experiments.

Performance Metrics Precision (P)- The percentage of table pairs determined by

iDisc to be on the same topic that are on the same topic according to the gold standard.

Recall (R) - The percentage of table pairs determined by the domain expert to be on the same topic that are discovered by iDisc.

F -measure (F1) -F-measure is used when precision P and recall R are equally weighted, i.e., F1 = 2PR/(R + P).

Empirical EvaluationExperiment Setup

Experiments The utility of various database representations and the

accuracy of individual base clusterers. The aggregation accuracy of the baseline meta-clustering

algorithm. The impact of the proposed complex aggregation techniques. For all the base clusterers and meta-clusterers, the default Q

and Q’ were employed. Vector-based representations were constructed from table &

attribute names and the Cosine function was employed for computing vector similarities. Since the databases contain a huge number of rows, a sample of 4k size for each attribute was created, and was used for discovering foreign keys and attribute matches .

Results & Observations

Base clusterers employing a complete-link algorithm(CL) tend to have higher precision and lower recall than ones based on a single-link algorithm (SL). CL-based clusterers typically produce a large number of small clusters, while SL-based ones produce a small number of large clusters.

The precision of base clusterers using graph-based representations is relatively low in HR1 & HR3.

The base clusterers utilizing vector-based representations perform consistently well over all three databases. This is due to the fact that similar tables in these database tend to have many common words, e.g., Emp, Emp Resume, and Emp Photo.

Base clusterers utilizing similarity-based representations had poor performance on HR2 since a large number of tables in HR2 have several similar timestamp-like columns, e.g., create dt and del dt for table creation and deletion datetimes. Thus many tables are falsely determined to be similar to each other.

Meta - clusterers Observations:

The effects of “bad” base clusterers can be cancelled out.

In HR1, the precision of Meta-Vec (87.6%) is much higher than that of Vec-SL (44.8%).

Meta-All is far more accurate than Vec-SL, Graph-SP, and Graph-SPC, and its F1 is higher than that of all base clusterers.

Number of Topics Number of topics were compared. (a) plots the total numbers of topics versus the databases. (b) plots the total numbers of topics with at least two tables

versus the databases. Observation:

Number of Topics discovered by iDisc are very close to the numbers given by gold standard.

Empirical EvaluationDiscussion

iDisc may disagree with the domain expert on the granularity of partitioning and the number of subject areas in a database.

iDisc and the domain expert may also disagree on the assignment of the tables to the clusters, particularly for those “boundary” tables that connect several related entities.

Some databases in our experiments contain reference tables such as country (with attributes like name, region and ISO code) and language (with attributes like name and code). These tables are often referred to from multiple subject areas.

Related Work

Mining Database Structures: Bellman discovers join relationships (join-able attributes

ie. Attributes which are semantically similar)among tables in a database. Similar attributes are calculated using set resemblance functions similar to Jaccard function. Two tables connected via the join relation may not be on same topic. Goal of our work is to identify such inter- topic links and partition tables accordingly.

Data modeling products like Erwin & RDA facilitate modular development during a top-down modeling process. It enables us to create and maintain databases. It streamlines design process and synchronizes the model with the database design. Our solution complements these functions by enabling users to reverse-engineer subject areas from a large-scale physical database during a bottom-up modeling process.

Related Work

Information Integration & Complexity Issues Key problem today. Fragment oriented approach to match large schemas to reduce

matching complexity. It decomposes large schemas into several sub-schemas or

fragments (either a XML schema segment or relational table or manually specified) & performs fragment-wise matching.

Our work provides an automatic approach to partitioning a large schema into semantically meaningful fragments.

Multi-Strategy Learning & Clustering Aggregation LSD (a system that matches source schemas against a

mediated schema for data integration) employs a set of base learners. The predictions from these base learners are combined via the meta-learner.LSD requires training data.

In contrast base clusterers and meta-clusterers in Idisc do not require training ie iDisc is unsupervised.

Future Work

Two directions to extend iDisc : Plan to develop soft- clustering and meta clustering

techniques and incorporate them in iDisc to examine their impact on its performance.

Plan to extend iDisc to produce hierarchical topical structure, where each topic may be further divided into sub-topics.

Advantages: Enable directory style semantic browsing. Further support the divide-and-conquer approach to

schema matching. Reduce the complexity of large scale integration.

Conclusions

iDisc is unique in that It examines the database from varied perspectives to

construct multiple representations. It employs a multi-strategy framework to effectively

combine evidence through meta-clustering. It employs novel multi-level aggregation and clusterer

boosting techniques to handle complex aggregations. It employs novel measure on table importance to

effectively discover cluster representatives. Experiments over several large real-world databases

indicate that iDisc is highly effective, with an accuracy rate of up to 87%.

Thank you

d iscovering t opical s tructures of d atabases professor: michalis petropoulos cse 705 megha ramesh...

Documents