minor thesis a scalable schema matching framework for relational databases student: ahmed saimon...

18
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: 110022478 Award: MSc (Computer & Information Science) Date: 17 th September 2010 Supervisor: Dr. Jixue Liu

Upload: erick-arnold

Post on 02-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Minor Thesis

A scalable schema matching framework for relational databases

Student: Ahmed Saimon AdamID: 110022478Award: MSc (Computer & Information Science)Date: 17th September 2010Supervisor: Dr. Jixue Liu

Field of thesis•Schema matching•Relations database integration

INTRODUCTION

•What is a database schema?▫Structure of a database that describes how its

concepts, their relationships and constraints are arranged

•What is Schema matching?▫process of identifying semantic correspondences

between elements of database schemas

INTRODUCTION•What is Schema matching?

Schema matching applications▫Critical task in any data sharing process▫Data warehousing

Consolidation of multiple transaction processing databases▫database integration processes

Eg: two companies merge, integrate employee, inventory, financial databases

▫Cooperation between government agencies and various institutions.

Eg. Police/transport dept, Immigration and universities

Importance of the research•Currently done manually and semi automatically•Doing manually: tedious, error-prone, costly•No fully automatic system available

require user interaction• semantic query processing, mobile web, ecommerce

collaboration in enterprises•Demand for more scalable, accurate, efficient

schema matching technology increasing

Research objectives•Propose a framework that▫adopts a scalable architecture▫Offers a library of schema matching algorithms that

exploit various information for better accuracy▫ is independent of any specific application domain

Methodology•Build a framework by adopting a composite

architecture•Create a library of matchers at different levels•Build a prototype and perform empirical evaluation

on it to test accuracy, scalability and efficiency

Schema Matching Architecture• Input▫Represented in SQL DDL format

….. CREATE TABLE StudentDB.Student(

studentId INT,studentName VARCHAR(100),studentPhone VARCHAR(50)PRIMARY KEY (studentId) );

…..

Schema Matching Architecture• Input▫Currently supports versions after Oracle9 and SQL

Server 2000 Uses a data type conversion table if different DBMS

▫Input processor extracts schema information Eg: element names, data types, keys

Schema Matching Architecture•Process (schema matching)▫Implements multiple matching algorithms (matchers)

•Schema level▫Element names similarity algorithms

Prefix, Suffix, n-gram Tech = Technology (prefix matching) Phone = telephone (suffix matching) Context Con, ont, nte, tex, ext (ngram)

▫Structural similarities Data type, Field length etc.

Schema Matching Architecture• Instance Level▫Statistical data

Statistical data obtained: eg. Range, % alphanumeric characters, statistical properties (eg: mean, std.dev), distinct values etc.

▫Discovering complex correspondences Mining actual values Match different data types (gender : M,F = 1,2) Ambiguity issues: Jaguar (car or animal)?

Schema Matching Architecture•Output▫Similarity score between attributes obtained in each

matching algorithm all scores normalized between 0 to 1

▫Match results in similarity cube Attribute level, table level, schema level similarities can

be generated

Methodology

•Schema matching prototype in C# .NET

Experimental Evaluation

•Accuracy▫Tested on 2 small schemas of 10 tables each with 2-10

attributes▫Checked results against manually derived result▫Accuracy degrades as schema size increases▫55-60% true matching▫Tested on a schema with 140 tables and 1360

attributes 20-40% true matching

Experimental Evaluation

Efficiency•Drastic fall in efficiency as schema size increases

Conclusion•A basic framework for schema matching is proposed•Matching functions performed independently for

higher scalability so that additional algorithms can be integrated easily

•Needs improvement in efficiency by deploying hybrid matching algorithms

•Requires various different algorithms to assess similarities from different views and increase accuracy

END