conference svm classifier
Post on 03-Apr-2018
213 Views
Preview:
TRANSCRIPT
-
7/28/2019 Conference SVM Classifier
1/6
A Novel two-step Approach for Record Pair Classification in
Record Linkage Process
K.Gomathi
PG Scholar,
Department of CSE,Anna University Chennai
Archana Institute of Technology, Krishnagiri
Mathikrishnan55@gmail.com
AbstractThe aim of Record Linkage is to match the records that refer to same entity. One of the major
challenges when linking large databases is the efficient and accurate classification of candidate record pairwhich is generated during indexing step to match, on-match and possible match. While traditional
classification methods was based on setting the threshold level manually which is cumbersome and time
consuming process. In this paper we proposed a two-step approach that classifies the candidate record pairsautomatically. In first step training set is automatically selected from compared candidate record pair using
weight vector classifier. In second step Support Vector Classifier is used to improve the performance of
training set used in first step. Experimental results show that this two step approach can achieve better
classification results than other unsupervised approaches.
KeywordsIndexing, Record linkage, training set, candidate record, classifier.
I. INTRODUCTION
The task of linking databases is an important step in
increasing number of data mining projects like fraudand crime detection, national security, bioinformatics
etc because linked data contain information that isnot available otherwise it would require time
consuming and expensive collection of specific data.
The records to be matched frequently correspond to
entities that refer to people, such as clients orcustomers, patients, employees, tax payers, students,
or travelers. The task of record linkage is now
commonly used for improving data quality and
integrity, to allow re-use of existing data sources for
new studies, and to reduce costs and efforts in dataacquisition.
1.1 Record Linkage ProcessThe general steps involved in the linking of two
databases. Most of real data is dirty and contains
noise or incomplete information.
Figure 1.Steps involved in record linkage process
Input Datasets
Data Standardization
Sorted Block Indexing
Edit Distance
Input data
Cleaning Process
Indexing Process
Data Comparison
Pair Classification Novel approach
Evaluation
-
7/28/2019 Conference SVM Classifier
2/6
The main task of data cleaning and standardization is
the conversion of the raw input data into well defined,consistent forms, as well as the resolution of
inconsistencies in the way information is represented
and encoded. The second step indexing stepgenerates pairs of candidate records that arecompared in detail in the comparison step using a
variety of comparison functions appropriate to the
content of the record fields (attributes).the next stepin the record linkage process is to classify the
compared candidate record pairs into matches,
non-matches, and possible matches, depending upon
the decision model used. If record pairs are classifiedinto possible matches, a clerical review process is
required where these pairs are manually assessed and
classified into matches or non matches. This isusually a time-consuming, cumbersome and error-
prone process, especially when large databases are
being linked or deduplicated. Measuring and
evaluating the quality and complexity of a recordlinkage project is a final step in the record linkage
process.
II INDEXING FOR RECORD LINKAGE
The aim of indexing step is to reduce large number of
potential comparisons by removing as many records
as possible that corresponds to non matches. The
traditional record linkage has employed a indexingtechnique called blocking [2] which splits the
database into non overlapping blocks. A blocking
criterion called blocking key is used which is a singlerecord field (attribute) or the concatenation of values
from other fields.
Because real world data contains dirty whichcontains errors the important criteria of good
blocking key is that it can group similar values in to
same block. Similarity can refer to similar sounding
or similar looking values based on phoneticcharacteristics. For strings contain same personal
names phonetic similarity can be obtained by using
phonetic encoding functions like Soundex, Doublemetaphone etc.
TABLE 1
Example records with surname and soundex encoding
used for blocking using Blocking Key
Identifiers Surnames BK(soundex encoding)
R1 Smith S530R2 Miller M460
R3 Peters P362
R4 Myler M460
R5 Smyth S530
R6 Millar M460
R7 Smyth S530
R8 Miller M460
Table 1 illustrates the small data set with soundexencoding scheme. For example S530 fro Table 1
denotes the pair (R1, R5), (R5, R7), (R1, R7) are
generated. These pairs are called candidate recordpairs which are compared in comparison step by
using various string comparison functions.
2.1 Hamming Distance Comparison function
The hamming distance is used primarily for
numerical fixed size fields like Zip Code or SSN. It
counts the number of mismatches between twonumbers. For example the hamming distance
between zip codes 47905 and 46901 is 2 since it
has two mismatches.
2.2 Edit Distance Comparison function
The edit distance between two strings is the
minimum cost to convert one of them to the other bya sequence of character insertions, deletions and
replacements. Each one of these modifications is
assigned a cost value. For insertion and deletion thecost is equal to 1 and for replacement the cost is
equal to .To compute the edit distance is the Smith-
Waterman algorithm that uses dynamic programmingtechnique. Edit distance based function is moreaccurate than hamming distance since it does not find
the similarity between two strings.
-
7/28/2019 Conference SVM Classifier
3/6
III COMPARISON OF RECORDS USING
WEIGHT VECTORS
The two records in a candidate pair which is
generated during indexing are compared using
similarity functions applied to selected recordattributes. These functions can be as an exact stringor a numerical comparison can take a typographical
variations. There are also various approaches to learn
such similarity functions from training data [5].Eachsimilarity function return a numerical matching
weight that is usually normalized such that 1
corresponds to exact match and 0 corresponds to no
match. Some similar values having a match weight issomewhere between 0 and 1.
TABLE 2
Comparison of records using weight vectorsRecords Name Surname Street No
R1 Christine Smith 42
R2 Christina Smith 42
R3 Bob OBrain 11
R4 Robert Byree 12
WV (R1, R2): [0.9, 1.0, 1.0]WV (R1, R3): [0.0, 0.0, 0.0]
WV (R1, R4): [0.0, 0.0, 0.5]
WV (R2, R3): [0.0, 0.0, 0.0]
WV (R2, R4): [0.0, 0.0, 0.5]WV (R3, R4): [0.7, 0.3, 0.5]
As illustrated in the table for each compared record
pair weight vector is formed that contains thematching weights calculated for that pair. Using these
vectors the pairs are classified as matches, possibly
match, and on matches depending upon the decision
model used.
III TWO-STEP CLASSIFICATION
The idea of second record pair classification is based
on the following assumptions. First, weight vectorsthat contain exact or high similarity values in all theirmatching weights were with high likelihood
generated when two records that refer to same entity
are compared. Second weight vectors that containmostly low similarity values were with high
likelihood generated when two records that refer to
different entity were compared. As a result selecting
such weight vectors in a first step for generating
training data and for training a classifier using these
weight vectors is the second step should enableautomatic efficient and accurate record pair
classification.
3.1 Step 1: Selection of Training DataThe aim of the first step is to select weight vectorsfrom the set W of all weight vectors, generated in the
comparison step, which corresponds to true matches
and true non matches and to insert them into matchedtraining set WM, and the non match training set as
WN, respectively. There are two approach to select
training set either using distance threshold or nearest
neighborhood based. We selects nearest basedapproach since it outperforms than distance threshold
approach.
In this approach weight vectors are sortedaccording to their distances using Euclidean distance.
Vectors are sorted from the vectors containing only
exact similarities and only total dissimilarities and
nearest vectors and those vectors are selected fortraining sets. An estimation of ratio r of true matches
to true non matches can be calculated using the
number of records in the two databases to be linked.
A and B and the number of weight vectors W:
Where . denoting the number of elements in aset of database. The problem with balanced training
set is that weight vectors that likely do not refer to
true matches will be selected to WM.
3.2 Step 2: Classification of Record Pairs
Once the training sets for matches and non-
matches have been generated, they can be used totrain any binary classifier. In this paper the nearest
neighborhood classifier and iterative SVM classifier
are investigated. In the following section the set of
weight vectors not selected in the training sets isdenoted with WU, with WU=W\(WM U WN)
3.2.1 Nearest Neighbor classification
The basic idea of this classifier is toiteratively add unclassified weight vectors from WU
into the training sets until all weight vectors are
classified. In each iteration the unclassified weightvectors closest to k already classified weight vectors
is classified according to the majority vote of its
-
7/28/2019 Conference SVM Classifier
4/6
classified neighbors(i.e. if the majority is either
matches or non-matches).Using training example sets
the nearest based can be implemented efficiently asshown in Figure 2.
Figure 2: Example of nearest neighbor classifier with
2 dimensional weight vectors with k=1.
3.2.2 Iterative SVM ClassificationThe iterative SVM classifier is used to train
an initial SVM using training example set WM and
WN and then to iteratively add the strongest positiveand negative vectors from WU into training sets of
subsequent SVMs.
3.3 Measurement Tools
The following subsection introduces themetrics using the following notation. Let nM and nUbe the total number of matched and non matched
record pairs in the entire dataset. Let s be the size of
the reduced comparison space generated by thesearching method and let sM and sU is the total
number of matched and non matched record pairs in
the reduced comparison space. Finally let ca,d be thenumber of record pairs whose actual matching status
is a, and whose predicted matching status is d, where
a is either M or U and d is either M,U or P, whereM,U and P represents the matched, unmatched and
possibly matched respectively.
3.3.1 Reduction Ratio
The reduction ratio metric is defined as
RR=1-s/( nM+ nU).It measures the relative reductionin the size of comparison space accomplished by a
searching method.
3.3.2 Pairs CompletenessA searching method can be evaluated based
on the number of actual matched record pairs
contained in its reduced comparison space. Wedefine the pairs completeness metric as ratio of the
matched record pairs found in the reduced
comparison space, to the total number of matched
pairs in the entire comparison space. Formally thepairs completeness metric id defined as PC=sM/nM.
3.3.3 Accuracy
The accuracy metric tests how accurate adecision model is. The accuracy of a decision modelis defined to be the percentage of the correctly
classified record pairs. Formally the accuracy metric
can be defined as AC=(cM,M+cU,U)/s.
IV EXPERIMENTAL EVALUATION
The two record pair classifiers presentedabove were evaluated and compared with other two
classifiers. The first classifier has access to the true
match status of all weight vectors. Nine parameter
variations were evaluated. The second method isbased on hybrid approach implemented in TAILOR
toolbox [6].It first employs a k-means and then usesthe match and non-match clusters to train SVM. The
Euclidean distance function was evaluated for the k
means step, while for classifier step nine parameter
variations were used.
TABLE 3
Datasets used in experiments
DatasetNumber
of
records
PC RRNo ofweight
vectors
Ratio r
DS-A 1,000 0.957 0.995 2,475 1/1.48
DS-B 2,500 0.940 0.997 9,878 1/2.95
DS-C 5,000 0.953 0.997 35,491 1/6.10
DS-D 10,000 0.948 0.997 132,532 1/12.25
Experiments were conducted using syntheticdata as summarized in Table 3.The four synthetic
datasets of various sizes were created during using
Febrl dataset generator. This synthetic data containsname and address attributes that are based on realworld frequency tables and includes 60% original
records and 40% duplicate records. These duplicates
were randomly selected through modification ofrecord attributes. All classifiers are implemented
using febrl [3] record linkage system.
0
0.2
0.4
0.6
0.8
1
0.2 0.4 0.6 0.8 1
matches
nonmatches
unclassified
-
7/28/2019 Conference SVM Classifier
5/6
TABLE 4
Quality of nearest neighborhood Classifier
Seed
sizeDS-A DS-B DS-C DS-D
1% 100% 99.0% 100% 100%5% 96.7% 98.4% 99.8% 99.8%
10% 95.5% 98.3% 99.5% 99.7%
Table 4 shows the quality of the seed training setsgenerated in the first step of proposed classification
approach given as percentage of correctly selected
vectors.
4.1 Results and Discussion
As seen in the Table 4,the training example
set selected in the first step of the two stepclassification approach are mostly very high quality
with match training set. While overall 1% training set
selection contains highest percentage but the size is
very small so classifiers based on 1% is worse whencompared to other classifiers like 5% and
10%.Therefore 1% classifier is not included in F
score results presented in figure 3.
Figure 3.1 F score measure for DS-B (2,500 records)
The F score measure for DS-A dataset A has not
included since the seed size is 1% and the record sizeis small. So the dataset is not included.The weight
vectors generated when attribute values are compared
with mixed distribution of matches and non matchesthat are hard to classify without knowing the true
match status of these vectors. This can be seen in
Table 4 where with 10% seed size the training set
contains 9% non matches.
Figure 3.2 F score measure for DS-C(5,000) records
Figure 3.3 F score for DS-C(10,000) records
The F score results for four synthetic datasets
are shown in figure 3.As expected the supervised
SVM classifier performs best on all data sets. The
TAILOR classifier has the lowest performance onmost of datasets. The nearest based performs better
than iterative SVM for all synthetic datasets. These
experiments showed the limitation of unsupervised
classification like TAILOR hybrid approach whichworks based only pair wise attribute similarities
compared to supervised classification.
0
0.2
0.4
0.6
0.8
1
SVM TAILOR NEAREST
BASED
SVM 0-0 SVM 25-
50
0
0.2
0.4
0.6
0.8
1
1.2
SVM TAILOR NEAREST
BASED
SVM 0-0 SVM 25-
50
0
0.2
0.4
0.6
0.8
1
1.2
SVM TAILOR NEAREST
BASED
SVM 0-0 SVM 25-50
-
7/28/2019 Conference SVM Classifier
6/6
V CONCLUSION AND FUTURE WORK
This paper presented a novel unsupervised
two step approach to record pair classification that is
aimed at automating the record linkage process. Thisapproach combines automatic selection of training
data. The two classifiers nearest based and iterative
SVM achieve improved record pair classification
results compared with other classifiers such as
TAILOR approach.
Future work will include conducting more
experiments using different datasets including
runtime tests on datasets of various sizes in order to
experimentally get scalability results. The overallefficiency of the proposed classifier can further be
improved using data reduction and fast searching andindexing techniques.
REFERENCES
[1] A Survey of Indexing Techniques for Scalable
Record Linkage and Deduplication Peter ChristenIEEE TRANSACTIONS ON KNOWLEDGE AND
DATA ENGINEERING.
[2] A.K. Elmagarmid, P.G. Ipeirotis, and V.S.
Verykios, "Duplicate Record Detection: A Survey,"
IEEE Trans. Knowledge and Data Eng., vol. 19, no.
1, pp. 1-16, Jan. 2007.
[3] A. Aizawa and K. Oyama, "A Fast Linkage
Detection Scheme for Multi-Source InformationIntegration," Proc. Int'l Workshop Challenges in Web
Information Retrieval and Integration (WIRI '05),
2005.
[4] M.A. Hernandez and S.J. Stolfo, "Real-World
Data is Dirty: Data Cleansing and the Merge/Purge
Problem," Data Mining and Knowledge Discovery,vol. 2, no. 1, pp. 9-37, 1998.
[5] P. Christen, "Febrl: An Open Source Data
Cleaning, Deduplication and Record Linkage SystemWith a Graphical User Interface," Proc. 14th ACM
SIGKDD Int'l Conf. Knowledge Discovery and Data
Mining (KDD '08), pp. 1065-1068, 2008.
[6] P. Christen, "Automatic Record Linkage Using
Seeded Nearest Neighbour and Support VectorMachine Classification," Proc. 14th ACM SIGKDD
Int'l Conf. Knowledge Discovery and Data Mining
(KDD '08), pp. 151-159, 2008.
[7] R. Baxter, P. Christen, and T. Churches, "A
Comparison of Fast Blocking Methods for Record
Linkage," Proc. ACM Workshop Data Cleaning,Record Linkage and Object Consolidation (SIGKDD
'03), pp. 25-27, 2003.
[8] S. Sarawagi and A. Bhamidipaty, "InteractiveDeduplication Using Active Learning," Proc. Eighth
ACM SIGKDD Int'l Conf. Knowledge Discovery and
Data Mining (KDD '02), 2002.
[9] S. Tejada, C.A. Knoblock, and S. Minton,
"Learning Domain Independent String
Transformation Weights for High Accuracy ObjectIdentification," Proc. Eighth ACM SIGKDD Int'l
Conf. Knowledge Discovery and Data Mining (KDD
'02), 2002.
[10] T. Churches, P. Christen, K. Lim, and J.X. Zhu,
"Preparation of Name and Address Data for Record
Linkage Using Hidden Markov Models," BioMedCentral Medical Informatics and Decision Making,
vol. 2, no. 9, 2002.
top related