conference svm classifier

7/28/2019 Conference SVM Classifier

1/6

A Novel two-step Approach for Record Pair Classification in

Record Linkage Process

K.Gomathi

PG Scholar,

Department of CSE,Anna University Chennai

Archana Institute of Technology, Krishnagiri

[email protected]

AbstractThe aim of Record Linkage is to match the records that refer to same entity. One of the major

challenges when linking large databases is the efficient and accurate classification of candidate record pairwhich is generated during indexing step to match, on-match and possible match. While traditional

classification methods was based on setting the threshold level manually which is cumbersome and time

consuming process. In this paper we proposed a two-step approach that classifies the candidate record pairsautomatically. In first step training set is automatically selected from compared candidate record pair using

weight vector classifier. In second step Support Vector Classifier is used to improve the performance of

training set used in first step. Experimental results show that this two step approach can achieve better

classification results than other unsupervised approaches.

KeywordsIndexing, Record linkage, training set, candidate record, classifier.

I. INTRODUCTION

The task of linking databases is an important step in

increasing number of data mining projects like fraudand crime detection, national security, bioinformatics

etc because linked data contain information that isnot available otherwise it would require time

consuming and expensive collection of specific data.

The records to be matched frequently correspond to

entities that refer to people, such as clients orcustomers, patients, employees, tax payers, students,

or travelers. The task of record linkage is now

commonly used for improving data quality and

integrity, to allow re-use of existing data sources for

new studies, and to reduce costs and efforts in dataacquisition.

1.1 Record Linkage ProcessThe general steps involved in the linking of two

databases. Most of real data is dirty and contains

noise or incomplete information.

Figure 1.Steps involved in record linkage process

Input Datasets

Data Standardization

Sorted Block Indexing

Edit Distance

Input data

Cleaning Process

Indexing Process

Data Comparison

Pair Classification Novel approach

Evaluation


2/6

The main task of data cleaning and standardization is

the conversion of the raw input data into well defined,consistent forms, as well as the resolution of

inconsistencies in the way information is represented

and encoded. The second step indexing stepgenerates pairs of candidate records that arecompared in detail in the comparison step using a

variety of comparison functions appropriate to the

content of the record fields (attributes).the next stepin the record linkage process is to classify the

compared candidate record pairs into matches,

non-matches, and possible matches, depending upon

the decision model used. If record pairs are classifiedinto possible matches, a clerical review process is

required where these pairs are manually assessed and

classified into matches or non matches. This isusually a time-consuming, cumbersome and error-

prone process, especially when large databases are

being linked or deduplicated. Measuring and

evaluating the quality and complexity of a recordlinkage project is a final step in the record linkage

process.

II INDEXING FOR RECORD LINKAGE

The aim of indexing step is to reduce large number of

potential comparisons by removing as many records

as possible that corresponds to non matches. The

traditional record linkage has employed a indexingtechnique called blocking [2] which splits the

database into non overlapping blocks. A blocking

criterion called blocking key is used which is a singlerecord field (attribute) or the concatenation of values

from other fields.

Because real world data contains dirty whichcontains errors the important criteria of good

blocking key is that it can group similar values in to

same block. Similarity can refer to similar sounding

or similar looking values based on phoneticcharacteristics. For strings contain same personal

names phonetic similarity can be obtained by using

phonetic encoding functions like Soundex, Doublemetaphone etc.

TABLE 1

Example records with surname and soundex encoding

used for blocking using Blocking Key

Identifiers Surnames BK(soundex encoding)

R1 Smith S530R2 Miller M460

R3 Peters P362

R4 Myler M460

R5 Smyth S530

R6 Millar M460

R7 Smyth S530

R8 Miller M460

Table 1 illustrates the small data set with soundexencoding scheme. For example S530 fro Table 1

denotes the pair (R1, R5), (R5, R7), (R1, R7) are

generated. These pairs are called candidate recordpairs which are compared in comparison step by

using various string comparison functions.

2.1 Hamming Distance Comparison function

The hamming distance is used primarily for

numerical fixed size fields like Zip Code or SSN. It

counts the number of mismatches between twonumbers. For example the hamming distance

between zip codes 47905 and 46901 is 2 since it

has two mismatches.

2.2 Edit Distance Comparison function

The edit distance between two strings is the

minimum cost to convert one of them to the other bya sequence of character insertions, deletions and

replacements. Each one of these modifications is

assigned a cost value. For insertion and deletion thecost is equal to 1 and for replacement the cost is

equal to .To compute the edit distance is the Smith-

Waterman algorithm that uses dynamic programmingtechnique. Edit distance based function is moreaccurate than hamming distance since it does not find

the similarity between two strings.


3/6

III COMPARISON OF RECORDS USING

WEIGHT VECTORS

The two records in a candidate pair which is

generated during indexing are compared using

similarity functions applied to selected recordattributes. These functions can be as an exact stringor a numerical comparison can take a typographical

variations. There are also various approaches to learn

such similarity functions from training data [5].Eachsimilarity function return a numerical matching

weight that is usually normalized such that 1

corresponds to exact match and 0 corresponds to no

match. Some similar values having a match weight issomewhere between 0 and 1.

TABLE 2

Comparison of records using weight vectorsRecords Name Surname Street No

R1 Christine Smith 42

R2 Christina Smith 42

R3 Bob OBrain 11

R4 Robert Byree 12

WV (R1, R2): [0.9, 1.0, 1.0]WV (R1, R3): [0.0, 0.0, 0.0]

WV (R1, R4): [0.0, 0.0, 0.5]

WV (R2, R3): [0.0, 0.0, 0.0]

WV (R2, R4): [0.0, 0.0, 0.5]WV (R3, R4): [0.7, 0.3, 0.5]

As illustrated in the table for each compared record

pair weight vector is formed that contains thematching weights calculated for that pair. Using these

vectors the pairs are classified as matches, possibly

match, and on matches depending upon the decision

model used.

III TWO-STEP CLASSIFICATION

The idea of second record pair classification is based

on the following assumptions. First, weight vectorsthat contain exact or high similarity values in all theirmatching weights were with high likelihood

generated when two records that refer to same entity

are compared. Second weight vectors that containmostly low similarity values were with high

likelihood generated when two records that refer to

different entity were compared. As a result selecting

such weight vectors in a first step for generating

training data and for training a classifier using these

weight vectors is the second step should enableautomatic efficient and accurate record pair

classification.

3.1 Step 1: Selection of Training DataThe aim of the first step is to select weight vectorsfrom the set W of all weight vectors, generated in the

comparison step, which corresponds to true matches

and true non matches and to insert them into matchedtraining set WM, and the non match training set as

WN, respectively. There are two approach to select

training set either using distance threshold or nearest

neighborhood based. We selects nearest basedapproach since it outperforms than distance threshold

approach.

In this approach weight vectors are sortedaccording to their distances using Euclidean distance.

Vectors are sorted from the vectors containing only

exact similarities and only total dissimilarities and

nearest vectors and those vectors are selected fortraining sets. An estimation of ratio r of true matches

to true non matches can be calculated using the

number of records in the two databases to be linked.

A and B and the number of weight vectors W:

Where . denoting the number of elements in aset of database. The problem with balanced training

set is that weight vectors that likely do not refer to

true matches will be selected to WM.

3.2 Step 2: Classification of Record Pairs

Once the training sets for matches and non-

matches have been generated, they can be used totrain any binary classifier. In this paper the nearest

neighborhood classifier and iterative SVM classifier

are investigated. In the following section the set of

weight vectors not selected in the training sets isdenoted with WU, with WU=W\(WM U WN)

3.2.1 Nearest Neighbor classification

The basic idea of this classifier is toiteratively add unclassified weight vectors from WU

into the training sets until all weight vectors are

classified. In each iteration the unclassified weightvectors closest to k already classified weight vectors

is classified according to the majority vote of its


4/6

classified neighbors(i.e. if the majority is either

matches or non-matches).Using training example sets

the nearest based can be implemented efficiently asshown in Figure 2.

Figure 2: Example of nearest neighbor classifier with

2 dimensional weight vectors with k=1.

3.2.2 Iterative SVM ClassificationThe iterative SVM classifier is used to train

an initial SVM using training example set WM and

WN and then to iteratively add the strongest positiveand negative vectors from WU into training sets of

subsequent SVMs.

3.3 Measurement Tools

The following subsection introduces themetrics using the following notation. Let nM and nUbe the total number of matched and non matched

record pairs in the entire dataset. Let s be the size of

the reduced comparison space generated by thesearching method and let sM and sU is the total

number of matched and non matched record pairs in

the reduced comparison space. Finally let ca,d be thenumber of record pairs whose actual matching status

is a, and whose predicted matching status is d, where

a is either M or U and d is either M,U or P, whereM,U and P represents the matched, unmatched and

possibly matched respectively.

3.3.1 Reduction Ratio

The reduction ratio metric is defined as

RR=1-s/( nM+ nU).It measures the relative reductionin the size of comparison space accomplished by a

searching method.

3.3.2 Pairs CompletenessA searching method can be evaluated based

on the number of actual matched record pairs

contained in its reduced comparison space. Wedefine the pairs completeness metric as ratio of the

matched record pairs found in the reduced

comparison space, to the total number of matched

pairs in the entire comparison space. Formally thepairs completeness metric id defined as PC=sM/nM.

3.3.3 Accuracy

The accuracy metric tests how accurate adecision model is. The accuracy of a decision modelis defined to be the percentage of the correctly

classified record pairs. Formally the accuracy metric

can be defined as AC=(cM,M+cU,U)/s.

IV EXPERIMENTAL EVALUATION

The two record pair classifiers presentedabove were evaluated and compared with other two

classifiers. The first classifier has access to the true

match status of all weight vectors. Nine parameter

variations were evaluated. The second method isbased on hybrid approach implemented in TAILOR

toolbox [6].It first employs a k-means and then usesthe match and non-match clusters to train SVM. The

Euclidean distance function was evaluated for the k

means step, while for classifier step nine parameter

variations were used.

TABLE 3

Datasets used in experiments

DatasetNumber

of

records

PC RRNo ofweight

vectors

Ratio r

DS-A 1,000 0.957 0.995 2,475 1/1.48

DS-B 2,500 0.940 0.997 9,878 1/2.95

DS-C 5,000 0.953 0.997 35,491 1/6.10

DS-D 10,000 0.948 0.997 132,532 1/12.25

Experiments were conducted using syntheticdata as summarized in Table 3.The four synthetic

datasets of various sizes were created during using

Febrl dataset generator. This synthetic data containsname and address attributes that are based on realworld frequency tables and includes 60% original

records and 40% duplicate records. These duplicates

were randomly selected through modification ofrecord attributes. All classifiers are implemented

using febrl [3] record linkage system.

0

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

matches

nonmatches

unclassified


5/6

TABLE 4

Quality of nearest neighborhood Classifier

Seed

sizeDS-A DS-B DS-C DS-D

1% 100% 99.0% 100% 100%5% 96.7% 98.4% 99.8% 99.8%

10% 95.5% 98.3% 99.5% 99.7%

Table 4 shows the quality of the seed training setsgenerated in the first step of proposed classification

approach given as percentage of correctly selected

vectors.

4.1 Results and Discussion

As seen in the Table 4,the training example

set selected in the first step of the two stepclassification approach are mostly very high quality

with match training set. While overall 1% training set

selection contains highest percentage but the size is

very small so classifiers based on 1% is worse whencompared to other classifiers like 5% and

10%.Therefore 1% classifier is not included in F

score results presented in figure 3.

Figure 3.1 F score measure for DS-B (2,500 records)

The F score measure for DS-A dataset A has not

included since the seed size is 1% and the record sizeis small. So the dataset is not included.The weight

vectors generated when attribute values are compared

with mixed distribution of matches and non matchesthat are hard to classify without knowing the true

match status of these vectors. This can be seen in

Table 4 where with 10% seed size the training set

contains 9% non matches.

Figure 3.2 F score measure for DS-C(5,000) records

Figure 3.3 F score for DS-C(10,000) records

The F score results for four synthetic datasets

are shown in figure 3.As expected the supervised

SVM classifier performs best on all data sets. The

TAILOR classifier has the lowest performance onmost of datasets. The nearest based performs better

than iterative SVM for all synthetic datasets. These

experiments showed the limitation of unsupervised

classification like TAILOR hybrid approach whichworks based only pair wise attribute similarities

compared to supervised classification.

0

0.2

0.4

0.6

0.8

1

SVM TAILOR NEAREST

BASED

SVM 0-0 SVM 25-

50

0

0.2

0.4

0.6

0.8

1

1.2

SVM TAILOR NEAREST

BASED

SVM 0-0 SVM 25-

50

0

0.2

0.4

0.6

0.8

1

1.2

SVM TAILOR NEAREST

BASED

SVM 0-0 SVM 25-50


6/6

V CONCLUSION AND FUTURE WORK

This paper presented a novel unsupervised

two step approach to record pair classification that is

aimed at automating the record linkage process. Thisapproach combines automatic selection of training

data. The two classifiers nearest based and iterative

SVM achieve improved record pair classification

results compared with other classifiers such as

TAILOR approach.

Future work will include conducting more

experiments using different datasets including

runtime tests on datasets of various sizes in order to

experimentally get scalability results. The overallefficiency of the proposed classifier can further be

improved using data reduction and fast searching andindexing techniques.

REFERENCES

[1] A Survey of Indexing Techniques for Scalable

Record Linkage and Deduplication Peter ChristenIEEE TRANSACTIONS ON KNOWLEDGE AND

DATA ENGINEERING.

[2] A.K. Elmagarmid, P.G. Ipeirotis, and V.S.

Verykios, "Duplicate Record Detection: A Survey,"

IEEE Trans. Knowledge and Data Eng., vol. 19, no.

1, pp. 1-16, Jan. 2007.

[3] A. Aizawa and K. Oyama, "A Fast Linkage

Detection Scheme for Multi-Source InformationIntegration," Proc. Int'l Workshop Challenges in Web

Information Retrieval and Integration (WIRI '05),

2005.

[4] M.A. Hernandez and S.J. Stolfo, "Real-World

Data is Dirty: Data Cleansing and the Merge/Purge

Problem," Data Mining and Knowledge Discovery,vol. 2, no. 1, pp. 9-37, 1998.

[5] P. Christen, "Febrl: An Open Source Data

Cleaning, Deduplication and Record Linkage SystemWith a Graphical User Interface," Proc. 14th ACM

SIGKDD Int'l Conf. Knowledge Discovery and Data

Mining (KDD '08), pp. 1065-1068, 2008.

[6] P. Christen, "Automatic Record Linkage Using

Seeded Nearest Neighbour and Support VectorMachine Classification," Proc. 14th ACM SIGKDD

Int'l Conf. Knowledge Discovery and Data Mining

(KDD '08), pp. 151-159, 2008.

[7] R. Baxter, P. Christen, and T. Churches, "A

Comparison of Fast Blocking Methods for Record

Linkage," Proc. ACM Workshop Data Cleaning,Record Linkage and Object Consolidation (SIGKDD

'03), pp. 25-27, 2003.

[8] S. Sarawagi and A. Bhamidipaty, "InteractiveDeduplication Using Active Learning," Proc. Eighth

ACM SIGKDD Int'l Conf. Knowledge Discovery and

Data Mining (KDD '02), 2002.

[9] S. Tejada, C.A. Knoblock, and S. Minton,

"Learning Domain Independent String

Transformation Weights for High Accuracy ObjectIdentification," Proc. Eighth ACM SIGKDD Int'l

Conf. Knowledge Discovery and Data Mining (KDD

'02), 2002.

[10] T. Churches, P. Christen, K. Lim, and J.X. Zhu,

"Preparation of Name and Address Data for Record

Linkage Using Hidden Markov Models," BioMedCentral Medical Informatics and Decision Making,

vol. 2, no. 9, 2002.

conference svm classifier

Documents