conference svm classifier

Upload: adhiyaman-pownraj

Post on 03-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Conference SVM Classifier

    1/6

    A Novel two-step Approach for Record Pair Classification in

    Record Linkage Process

    K.Gomathi

    PG Scholar,

    Department of CSE,Anna University Chennai

    Archana Institute of Technology, Krishnagiri

    [email protected]

    AbstractThe aim of Record Linkage is to match the records that refer to same entity. One of the major

    challenges when linking large databases is the efficient and accurate classification of candidate record pairwhich is generated during indexing step to match, on-match and possible match. While traditional

    classification methods was based on setting the threshold level manually which is cumbersome and time

    consuming process. In this paper we proposed a two-step approach that classifies the candidate record pairsautomatically. In first step training set is automatically selected from compared candidate record pair using

    weight vector classifier. In second step Support Vector Classifier is used to improve the performance of

    training set used in first step. Experimental results show that this two step approach can achieve better

    classification results than other unsupervised approaches.

    KeywordsIndexing, Record linkage, training set, candidate record, classifier.

    I. INTRODUCTION

    The task of linking databases is an important step in

    increasing number of data mining projects like fraudand crime detection, national security, bioinformatics

    etc because linked data contain information that isnot available otherwise it would require time

    consuming and expensive collection of specific data.

    The records to be matched frequently correspond to

    entities that refer to people, such as clients orcustomers, patients, employees, tax payers, students,

    or travelers. The task of record linkage is now

    commonly used for improving data quality and

    integrity, to allow re-use of existing data sources for

    new studies, and to reduce costs and efforts in dataacquisition.

    1.1 Record Linkage ProcessThe general steps involved in the linking of two

    databases. Most of real data is dirty and contains

    noise or incomplete information.

    Figure 1.Steps involved in record linkage process

    Input Datasets

    Data Standardization

    Sorted Block Indexing

    Edit Distance

    Input data

    Cleaning Process

    Indexing Process

    Data Comparison

    Pair Classification Novel approach

    Evaluation

  • 7/28/2019 Conference SVM Classifier

    2/6

    The main task of data cleaning and standardization is

    the conversion of the raw input data into well defined,consistent forms, as well as the resolution of

    inconsistencies in the way information is represented

    and encoded. The second step indexing stepgenerates pairs of candidate records that arecompared in detail in the comparison step using a

    variety of comparison functions appropriate to the

    content of the record fields (attributes).the next stepin the record linkage process is to classify the

    compared candidate record pairs into matches,

    non-matches, and possible matches, depending upon

    the decision model used. If record pairs are classifiedinto possible matches, a clerical review process is

    required where these pairs are manually assessed and

    classified into matches or non matches. This isusually a time-consuming, cumbersome and error-

    prone process, especially when large databases are

    being linked or deduplicated. Measuring and

    evaluating the quality and complexity of a recordlinkage project is a final step in the record linkage

    process.

    II INDEXING FOR RECORD LINKAGE

    The aim of indexing step is to reduce large number of

    potential comparisons by removing as many records

    as possible that corresponds to non matches. The

    traditional record linkage has employed a indexingtechnique called blocking [2] which splits the

    database into non overlapping blocks. A blocking

    criterion called blocking key is used which is a singlerecord field (attribute) or the concatenation of values

    from other fields.

    Because real world data contains dirty whichcontains errors the important criteria of good

    blocking key is that it can group similar values in to

    same block. Similarity can refer to similar sounding

    or similar looking values based on phoneticcharacteristics. For strings contain same personal

    names phonetic similarity can be obtained by using

    phonetic encoding functions like Soundex, Doublemetaphone etc.

    TABLE 1

    Example records with surname and soundex encoding

    used for blocking using Blocking Key

    Identifiers Surnames BK(soundex encoding)

    R1 Smith S530R2 Miller M460

    R3 Peters P362

    R4 Myler M460

    R5 Smyth S530

    R6 Millar M460

    R7 Smyth S530

    R8 Miller M460

    Table 1 illustrates the small data set with soundexencoding scheme. For example S530 fro Table 1

    denotes the pair (R1, R5), (R5, R7), (R1, R7) are

    generated. These pairs are called candidate recordpairs which are compared in comparison step by

    using various string comparison functions.

    2.1 Hamming Distance Comparison function

    The hamming distance is used primarily for

    numerical fixed size fields like Zip Code or SSN. It

    counts the number of mismatches between twonumbers. For example the hamming distance

    between zip codes 47905 and 46901 is 2 since it

    has two mismatches.

    2.2 Edit Distance Comparison function

    The edit distance between two strings is the

    minimum cost to convert one of them to the other bya sequence of character insertions, deletions and

    replacements. Each one of these modifications is

    assigned a cost value. For insertion and deletion thecost is equal to 1 and for replacement the cost is

    equal to .To compute the edit distance is the Smith-

    Waterman algorithm that uses dynamic programmingtechnique. Edit distance based function is moreaccurate than hamming distance since it does not find

    the similarity between two strings.

  • 7/28/2019 Conference SVM Classifier

    3/6

    III COMPARISON OF RECORDS USING

    WEIGHT VECTORS

    The two records in a candidate pair which is

    generated during indexing are compared using

    similarity functions applied to selected recordattributes. These functions can be as an exact stringor a numerical comparison can take a typographical

    variations. There are also various approaches to learn

    such similarity functions from training data [5].Eachsimilarity function return a numerical matching

    weight that is usually normalized such that 1

    corresponds to exact match and 0 corresponds to no

    match. Some similar values having a match weight issomewhere between 0 and 1.

    TABLE 2

    Comparison of records using weight vectorsRecords Name Surname Street No

    R1 Christine Smith 42

    R2 Christina Smith 42

    R3 Bob OBrain 11

    R4 Robert Byree 12

    WV (R1, R2): [0.9, 1.0, 1.0]WV (R1, R3): [0.0, 0.0, 0.0]

    WV (R1, R4): [0.0, 0.0, 0.5]

    WV (R2, R3): [0.0, 0.0, 0.0]

    WV (R2, R4): [0.0, 0.0, 0.5]WV (R3, R4): [0.7, 0.3, 0.5]

    As illustrated in the table for each compared record

    pair weight vector is formed that contains thematching weights calculated for that pair. Using these

    vectors the pairs are classified as matches, possibly

    match, and on matches depending upon the decision

    model used.

    III TWO-STEP CLASSIFICATION

    The idea of second record pair classification is based

    on the following assumptions. First, weight vectorsthat contain exact or high similarity values in all theirmatching weights were with high likelihood

    generated when two records that refer to same entity

    are compared. Second weight vectors that containmostly low similarity values were with high

    likelihood generated when two records that refer to

    different entity were compared. As a result selecting

    such weight vectors in a first step for generating

    training data and for training a classifier using these

    weight vectors is the second step should enableautomatic efficient and accurate record pair

    classification.

    3.1 Step 1: Selection of Training DataThe aim of the first step is to select weight vectorsfrom the set W of all weight vectors, generated in the

    comparison step, which corresponds to true matches

    and true non matches and to insert them into matchedtraining set WM, and the non match training set as

    WN, respectively. There are two approach to select

    training set either using distance threshold or nearest

    neighborhood based. We selects nearest basedapproach since it outperforms than distance threshold

    approach.

    In this approach weight vectors are sortedaccording to their distances using Euclidean distance.

    Vectors are sorted from the vectors containing only

    exact similarities and only total dissimilarities and

    nearest vectors and those vectors are selected fortraining sets. An estimation of ratio r of true matches

    to true non matches can be calculated using the

    number of records in the two databases to be linked.

    A and B and the number of weight vectors W:

    Where . denoting the number of elements in aset of database. The problem with balanced training

    set is that weight vectors that likely do not refer to

    true matches will be selected to WM.

    3.2 Step 2: Classification of Record Pairs

    Once the training sets for matches and non-

    matches have been generated, they can be used totrain any binary classifier. In this paper the nearest

    neighborhood classifier and iterative SVM classifier

    are investigated. In the following section the set of

    weight vectors not selected in the training sets isdenoted with WU, with WU=W\(WM U WN)

    3.2.1 Nearest Neighbor classification

    The basic idea of this classifier is toiteratively add unclassified weight vectors from WU

    into the training sets until all weight vectors are

    classified. In each iteration the unclassified weightvectors closest to k already classified weight vectors

    is classified according to the majority vote of its

  • 7/28/2019 Conference SVM Classifier

    4/6

    classified neighbors(i.e. if the majority is either

    matches or non-matches).Using training example sets

    the nearest based can be implemented efficiently asshown in Figure 2.

    Figure 2: Example of nearest neighbor classifier with

    2 dimensional weight vectors with k=1.

    3.2.2 Iterative SVM ClassificationThe iterative SVM classifier is used to train

    an initial SVM using training example set WM and

    WN and then to iteratively add the strongest positiveand negative vectors from WU into training sets of

    subsequent SVMs.

    3.3 Measurement Tools

    The following subsection introduces themetrics using the following notation. Let nM and nUbe the total number of matched and non matched

    record pairs in the entire dataset. Let s be the size of

    the reduced comparison space generated by thesearching method and let sM and sU is the total

    number of matched and non matched record pairs in

    the reduced comparison space. Finally let ca,d be thenumber of record pairs whose actual matching status

    is a, and whose predicted matching status is d, where

    a is either M or U and d is either M,U or P, whereM,U and P represents the matched, unmatched and

    possibly matched respectively.

    3.3.1 Reduction Ratio

    The reduction ratio metric is defined as

    RR=1-s/( nM+ nU).It measures the relative reductionin the size of comparison space accomplished by a

    searching method.

    3.3.2 Pairs CompletenessA searching method can be evaluated based

    on the number of actual matched record pairs

    contained in its reduced comparison space. Wedefine the pairs completeness metric as ratio of the

    matched record pairs found in the reduced

    comparison space, to the total number of matched

    pairs in the entire comparison space. Formally thepairs completeness metric id defined as PC=sM/nM.

    3.3.3 Accuracy

    The accuracy metric tests how accurate adecision model is. The accuracy of a decision modelis defined to be the percentage of the correctly

    classified record pairs. Formally the accuracy metric

    can be defined as AC=(cM,M+cU,U)/s.

    IV EXPERIMENTAL EVALUATION

    The two record pair classifiers presentedabove were evaluated and compared with other two

    classifiers. The first classifier has access to the true

    match status of all weight vectors. Nine parameter

    variations were evaluated. The second method isbased on hybrid approach implemented in TAILOR

    toolbox [6].It first employs a k-means and then usesthe match and non-match clusters to train SVM. The

    Euclidean distance function was evaluated for the k

    means step, while for classifier step nine parameter

    variations were used.

    TABLE 3

    Datasets used in experiments

    DatasetNumber

    of

    records

    PC RRNo ofweight

    vectors

    Ratio r

    DS-A 1,000 0.957 0.995 2,475 1/1.48

    DS-B 2,500 0.940 0.997 9,878 1/2.95

    DS-C 5,000 0.953 0.997 35,491 1/6.10

    DS-D 10,000 0.948 0.997 132,532 1/12.25

    Experiments were conducted using syntheticdata as summarized in Table 3.The four synthetic

    datasets of various sizes were created during using

    Febrl dataset generator. This synthetic data containsname and address attributes that are based on realworld frequency tables and includes 60% original

    records and 40% duplicate records. These duplicates

    were randomly selected through modification ofrecord attributes. All classifiers are implemented

    using febrl [3] record linkage system.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0.2 0.4 0.6 0.8 1

    matches

    nonmatches

    unclassified

  • 7/28/2019 Conference SVM Classifier

    5/6

    TABLE 4

    Quality of nearest neighborhood Classifier

    Seed

    sizeDS-A DS-B DS-C DS-D

    1% 100% 99.0% 100% 100%5% 96.7% 98.4% 99.8% 99.8%

    10% 95.5% 98.3% 99.5% 99.7%

    Table 4 shows the quality of the seed training setsgenerated in the first step of proposed classification

    approach given as percentage of correctly selected

    vectors.

    4.1 Results and Discussion

    As seen in the Table 4,the training example

    set selected in the first step of the two stepclassification approach are mostly very high quality

    with match training set. While overall 1% training set

    selection contains highest percentage but the size is

    very small so classifiers based on 1% is worse whencompared to other classifiers like 5% and

    10%.Therefore 1% classifier is not included in F

    score results presented in figure 3.

    Figure 3.1 F score measure for DS-B (2,500 records)

    The F score measure for DS-A dataset A has not

    included since the seed size is 1% and the record sizeis small. So the dataset is not included.The weight

    vectors generated when attribute values are compared

    with mixed distribution of matches and non matchesthat are hard to classify without knowing the true

    match status of these vectors. This can be seen in

    Table 4 where with 10% seed size the training set

    contains 9% non matches.

    Figure 3.2 F score measure for DS-C(5,000) records

    Figure 3.3 F score for DS-C(10,000) records

    The F score results for four synthetic datasets

    are shown in figure 3.As expected the supervised

    SVM classifier performs best on all data sets. The

    TAILOR classifier has the lowest performance onmost of datasets. The nearest based performs better

    than iterative SVM for all synthetic datasets. These

    experiments showed the limitation of unsupervised

    classification like TAILOR hybrid approach whichworks based only pair wise attribute similarities

    compared to supervised classification.

    0

    0.2

    0.4

    0.6

    0.8

    1

    SVM TAILOR NEAREST

    BASED

    SVM 0-0 SVM 25-

    50

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    SVM TAILOR NEAREST

    BASED

    SVM 0-0 SVM 25-

    50

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    SVM TAILOR NEAREST

    BASED

    SVM 0-0 SVM 25-50

  • 7/28/2019 Conference SVM Classifier

    6/6

    V CONCLUSION AND FUTURE WORK

    This paper presented a novel unsupervised

    two step approach to record pair classification that is

    aimed at automating the record linkage process. Thisapproach combines automatic selection of training

    data. The two classifiers nearest based and iterative

    SVM achieve improved record pair classification

    results compared with other classifiers such as

    TAILOR approach.

    Future work will include conducting more

    experiments using different datasets including

    runtime tests on datasets of various sizes in order to

    experimentally get scalability results. The overallefficiency of the proposed classifier can further be

    improved using data reduction and fast searching andindexing techniques.

    REFERENCES

    [1] A Survey of Indexing Techniques for Scalable

    Record Linkage and Deduplication Peter ChristenIEEE TRANSACTIONS ON KNOWLEDGE AND

    DATA ENGINEERING.

    [2] A.K. Elmagarmid, P.G. Ipeirotis, and V.S.

    Verykios, "Duplicate Record Detection: A Survey,"

    IEEE Trans. Knowledge and Data Eng., vol. 19, no.

    1, pp. 1-16, Jan. 2007.

    [3] A. Aizawa and K. Oyama, "A Fast Linkage

    Detection Scheme for Multi-Source InformationIntegration," Proc. Int'l Workshop Challenges in Web

    Information Retrieval and Integration (WIRI '05),

    2005.

    [4] M.A. Hernandez and S.J. Stolfo, "Real-World

    Data is Dirty: Data Cleansing and the Merge/Purge

    Problem," Data Mining and Knowledge Discovery,vol. 2, no. 1, pp. 9-37, 1998.

    [5] P. Christen, "Febrl: An Open Source Data

    Cleaning, Deduplication and Record Linkage SystemWith a Graphical User Interface," Proc. 14th ACM

    SIGKDD Int'l Conf. Knowledge Discovery and Data

    Mining (KDD '08), pp. 1065-1068, 2008.

    [6] P. Christen, "Automatic Record Linkage Using

    Seeded Nearest Neighbour and Support VectorMachine Classification," Proc. 14th ACM SIGKDD

    Int'l Conf. Knowledge Discovery and Data Mining

    (KDD '08), pp. 151-159, 2008.

    [7] R. Baxter, P. Christen, and T. Churches, "A

    Comparison of Fast Blocking Methods for Record

    Linkage," Proc. ACM Workshop Data Cleaning,Record Linkage and Object Consolidation (SIGKDD

    '03), pp. 25-27, 2003.

    [8] S. Sarawagi and A. Bhamidipaty, "InteractiveDeduplication Using Active Learning," Proc. Eighth

    ACM SIGKDD Int'l Conf. Knowledge Discovery and

    Data Mining (KDD '02), 2002.

    [9] S. Tejada, C.A. Knoblock, and S. Minton,

    "Learning Domain Independent String

    Transformation Weights for High Accuracy ObjectIdentification," Proc. Eighth ACM SIGKDD Int'l

    Conf. Knowledge Discovery and Data Mining (KDD

    '02), 2002.

    [10] T. Churches, P. Christen, K. Lim, and J.X. Zhu,

    "Preparation of Name and Address Data for Record

    Linkage Using Hidden Markov Models," BioMedCentral Medical Informatics and Decision Making,

    vol. 2, no. 9, 2002.