SOUNDEX y FUZZY VLOOKUP FOR VBA EXCEL 201107
VBA]Function Str_Comp(st1 As String, st2 As String) As Double‘‘ returns a number showing % comparison between two names‘‘ i.e. =Str_Comp(A1, B1)‘‘ Format cell as Percentage to make it look pretty!!‘‘Dim MtchTbl(100, 100)Dim MyMax As Double, ThisMax As DoubleDim i As Integer, j As Integer, ii As Integer, jj As Integer
With WorksheetFunctionst1$ = Trim$(.Proper(st1$))st2$ = Trim$(.Proper(st2$))End With
MyMax# = 0
For i% = Len(st1$) To 1 Step -1For j% = Len(st2$) To 1 Step -1If Mid$(st1$, i%, 1) = Mid$(st2$, j%, 1) ThenThisMax# = 0For ii% = (i% + 1) To Len(st1$)For jj% = (j% + 1) To Len(st2$)If MtchTbl(ii%, jj%) > ThisMax# ThenThisMax# = MtchTbl(ii%, jj%)End IfNext jj%Next ii%MtchTbl(i%, j%) = ThisMax# + 1If (ThisMax# + 1) > ThisMax# ThenMyMax# = ThisMax# + 1End IfEnd IfNext j%Next i%Str_Comp = MyMax# / ((Len(st1$) + Len(st2$)) / 2)
End Function[/VBA]
Here's the details:
Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant)
Find the best match for a given string in column 1 of an array of data obtained from an Excel rangeThis is functionally similar to VLookup, but it returns the best match, not the first exact matchThis function is not case-sensitive.
If your data quality is poor, you are advised to display the retrieved index value from column 1 and use the MatchWord function on this index value to reveal the fuzzy-matching 'score' and discard all results below a threshold score.
Use VFuzzyLookup_Phrase if you are trying to match a phrase or sentence (a sequence of words separated by spaces) as that function is faster, and has additional logic for the word order.
Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional logic to normalise the common abbreviations and word-order conventions used in British addresses.
Yes, you read that correctly: you have a best-match lookup, vFuzzyLookup_Address() for lists of names and addresses. It could do with some performance-tuning (I wouldn't try it on lists exceeding 1024 members if you're in any kind of hurry) but it's usable.
The functions which will be of most interest to other programmers are probably MatchWord(), the simple string-comparison function which returns a percentage score based on Levenshtein edit distance; and MatchPhrase(), which uses MatchWord to create a comparison grid for the words of two sentences, and compares the constructed sequence of 'best match' words in the first sentence with their actual occurrence in the second.
This is the full list of functions , with internal page links:
VfuzzyLookup() VFuzzyLookup_Phrase VFuzzyLookup_Address MatchPhrase MatchWord Levenshtein NormaliseAddress StripChars Substitute
In another post I have coded up an alternative approach to the Levenshtein edit distance: a 'sum of common strings' score that may give better results for longer phrases and passages of text.
Option Explicit
Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)
' Find the best match for a given string in column 1 of an array of data obtained from an Excel range' This is functionally similar to VLookup, but it returns the best match, not the first exact match' This function is not case-sensitive.
' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the MatchWord() function on this index value to reveal the fuzzy-matching 'score' and' discard all results below a threshold score.
' Use VFuzzyLookup_Phrases if you are trying to match a phrase or sentence (a sequence of words' separated by spaces) as that function is faster, and has additional logic for the word order.
' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional' logic to normalise the common abbreviations and word-order conventions used in British addresses.
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iRowBest As IntegerDim dblMatch As DoubleDim iRow As IntegerDim strTest As StringDim strInput As String
Dim iStartCol As IntegerDim iEndCol As IntegerDim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then 'Table_Array is not an array VFuzzyLookup = "#VALUE" Exit FunctionEnd If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then Table_Array = Table_Array.ValueEnd If
' If you get a subscript-out-of-bounds error here, you're using a vector instead' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartCol = LBound(Table_Array, 2)iEndCol = UBound(Table_Array, 2)iOffset = 1 - iStartCol
Col_Index_Num = Col_Index_Num - iOffset
If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then 'Out-of-bounds VFuzzyLookup = "#VALUE" Exit Function
End If
strInput = UCase(Lookup_Value)
iRowBest = -1 dblBestMatch = 0
For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)
strTest = "" strTest = Table_Array(iRow, iStartCol)
dblMatch = 0 dblMatch = MatchWord(strInput, strTest)
If dblMatch = 1 Then ' Bail out on finding an exact match iRowBest = iRow Exit For End If
If dblMatch > dblBestMatch Then dblBestMatch = dblMatch iRowBest = iRow End If
Next iRow
If iRowBest = -1 Then VFuzzyLookup = "#NO MATCH" Exit Function End If
VFuzzyLookup = Table_Array(iRowBest, Col_Index_Num)
End Function
Public Function MatchWord(ByVal str1 As String, ByVal str2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double
' Returns a percentage estimate of how closely word 1 matches word 2' Edit distances exceeding the length of str1 are discarded, returning a percentage match of zero
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim maxLen As Integer
Dim minLen As Integer
If Compare = vbTextCompare Then str1 = UCase(str1) str2 = UCase(str2)End If
If str1 = str2 Then MatchWord = 1 Exit Function End If
If Len(str1) > Len(str2) Then maxLen = Len(str1) minLen = Len(str2) Else maxLen = Len(str2) minLen = Len(str1) End If
MatchWord = 0 MatchWord = Levenshtein(str1, str2)
If MatchWord >= minLen Then MatchWord = 0 Else MatchWord = (maxLen - MatchWord) / maxLen End If
End Function
Public Function VFuzzyLookup_Phrase(Lookup_Phrase As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)
' Find the best match for a given phrase in column 1 of an array of data obtained from an Excel range
' Use this function to match a sentence (a sequence of words separated by spaces). Returns a score' based on matching word order weighted by the string-matching score of each individual word.
' This is functionally similar to VLookup, but it returns the best match, not the first exact match' This function is not case-sensitive.
' If your data quality is poor, you are advised to display the retrieved index value from column 1' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider' discarding all results below a threshold score.
' Use VFuzzyLookup for simple string comparisons.
' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional' logic to normalise the common abbreviations and word-order conventions used in British addresses.
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iRowBest As IntegerDim dblMatch As DoubleDim iRow As IntegerDim strTest As StringDim strInput As String
Dim iStartCol As IntegerDim iEndCol As IntegerDim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then 'Table_Array is not an array VFuzzyLookup_Phrase = "#VALUE" Exit FunctionEnd If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then Table_Array = Table_Array.ValueEnd If
' If you get a subscript-out-of-bounds error here, you're using a vector instead' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartCol = LBound(Table_Array, 2)iEndCol = UBound(Table_Array, 2)iOffset = 1 - iStartCol
Col_Index_Num = Col_Index_Num - iOffset
If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then 'Out-of-bounds VFuzzyLookup_Phrase = "#VALUE" Exit FunctionEnd If
strInput = UCase(Lookup_Phrase)
iRowBest = -1
dblBestMatch = 0
For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)
strTest = "" strTest = Table_Array(iRow, iStartCol)
dblMatch = 0 dblMatch = MatchPhrase(strInput, strTest) ' Consider coding up a MatchPhrase_Express() function, with the preprocessing ' (StripChars, Split) of strInput done here, rather than repeatedly
If dblMatch = 1 Then ' Bail out on finding an exact match iRowBest = iRow Exit For End If
If dblMatch > dblBestMatch Then dblBestMatch = dblMatch iRowBest = iRow End If
Next iRow
If iRowBest = -1 Then VFuzzyLookup_Phrase = "#NO MATCH" Exit Function End If
VFuzzyLookup_Phrase = Table_Array(iRowBest, Col_Index_Num)
End Function
Public Function VFuzzyLookup_Address(Lookup_Address As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)
' Find the best match for a given postal address in column 1 of a table of addresses obtained from an Excel range
' This is functionally similar to VLookup, but it returns the best match, not the first exact match' This function is not case-sensitive and ignores common abbreviations eg: 'St' for 'Street'
' If your data quality is poor, you are advised to display the retrieved index value from column 1' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider' discarding all results below a threshold score.
' Use VFuzzyLookup for simple string comparisons.
' Use vFuzzyLookup_Phrase if you are looking up phrases or sentences that are not addresses: this address lookup' function discards a lot of common words like 'Street' and expands abbreviations like 'Ave' and 'Blvd'
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iRowBest As IntegerDim dblMatch As DoubleDim iRow As IntegerDim strTest As StringDim strInput As String
Dim iStartCol As IntegerDim iEndCol As IntegerDim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then 'Table_Array is not an array VFuzzyLookup_Address = "#VALUE" Exit FunctionEnd If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then Table_Array = Table_Array.ValueEnd If
' If you get a subscript-out-of-bounds error here, you're using a vector instead' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartCol = LBound(Table_Array, 2)iEndCol = UBound(Table_Array, 2)iOffset = 1 - iStartCol
Col_Index_Num = Col_Index_Num - iOffset
If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then 'Out-of-bounds VFuzzyLookup_Address = "#VALUE" Exit FunctionEnd If
strInput = Lookup_Address strInput = NormaliseAddress(strInput)
If strInput = "" Then 'Out-of-bounds VFuzzyLookup_Address = "#CANNOT READ ADDRESS" Exit FunctionEnd If
iRowBest = -1 dblBestMatch = 0
For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)
strTest = "" strTest = Table_Array(iRow, iStartCol) strTest = NormaliseAddress(strTest)
If strTest <> "" Then
dblMatch = 0 dblMatch = MatchPhrase(strInput, strTest) ' Consider coding up a MatchPhrase_Express() function, with the preprocessing ' (StripChars, Split) of strInput done here, rather than repeatedly
If dblMatch = 1 Then ' Bail out on finding an exact match iRowBest = iRow Exit For End If
If dblMatch > dblBestMatch Then dblBestMatch = dblMatch iRowBest = iRow End If
End If ' strTest <> ""
Next iRow
If iRowBest = -1 Then VFuzzyLookup_Address = "#NO MATCH" Exit Function End If
VFuzzyLookup_Address = Table_Array(iRowBest, Col_Index_Num)
End Function
Public Function MatchPhrase(ByVal Phrase1 As String, ByVal Phrase2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double
' Function to compare two sentences. A version of this will be released to cater for the' specific needs of matching addresses, where we can make some assumptions about common' word-substitutions and abbreviations.
' THIS CODE IS IN THE PUBLIC DOMAIN
' This function consists of six processes:
' 1 Break out the phrases into arrays of words using the space character as the delimiter' 2 Populate a grid of word-matching scores for each word in Phrase 1 against Phrase 2;' 3 For each word in Phrase 1, identify the 'best match' from the words in Phrase 2' 4 Resolve 'collisions' - two or more words in phrase 1 matching the same word in phrase 2' 5 Compare the actual sequence of words in P1 with the positions of the matched words in P2' 6 Weight this comparison by the degree of matching measured at the level of individual words
' Process 4, resolving collisions, is an iterative loop inside process 3' Process 1 has an addditional step to check for deleted spaces
Dim arr1() As String ' Phrase 1, broken out into individual wordsDim arr2() As String
Dim arrScores() As Double ' an array of percentage matches of each word in p1 against each word in p2
' These two vectors are redundant in the sense that they hold information which ' can be extracted from arrScores(). However, using them saves a lot of looping:
Dim arrPositions() As Integer ' For each word in p1, the position of the best-matching word in p2Dim arrSequence() As Double ' For each word in p1, a score for its concordance with a constructed sequence of matching words in P2
Dim n As Double ' should be an integer, but it will be used in floating-point ' division and I prefer to avoid casting in VBADim s1 As StringDim s2 As String
Dim i As IntegerDim j As IntegerDim k As Integer
Dim iOffset As IntegerDim iShift As IntegerDim iDelete As Integer
Dim iPos As IntegerDim jPos As IntegerDim kPos As Integer
Dim iTotalLen As Integer
Dim dScore As DoubleDim dBest As DoubleDim dPenalty As Double
Dim d1 As DoubleDim d2 As Double
If Compare = vbTextCompare Then Phrase1 = UCase(Phrase1) Phrase2 = UCase(Phrase2)End If
If Phrase1 = Phrase2 Then MatchPhrase = 1 Exit FunctionEnd If
' The line labels SplitSpace1 and SplitSpace2 are resynchronisation points for' restarting the process after restoring a deleted space in Phase1 or Phrase2.
Phrase1 = StripChars(Phrase1, " ")SplitSpace1:arr1 = Split(Phrase1, " ")
Phrase2 = StripChars(Phrase2, " ")SplitSpace2:arr2 = Split(Phrase2, " ")
ReDim arrScores(LBound(arr1) To UBound(arr1), LBound(arr2) To UBound(arr2))ReDim arrPositions(LBound(arr1) To UBound(arr1))ReDim arrSequence(LBound(arr1) To UBound(arr1))
' Test for deleted spaces. This is a lot of work, but a missing space is a' common error and the effects are out of all proportion to the size of the' error: so much so that I'm prepared to risk the occasional 'false alarm'.' It may even be worth repeating these two loops using fuzzy-matching with' Levenshtein scores rather than the simple string-comparisons shown below:
For i = LBound(arr1) To UBound(arr1) - 1
If arr1(i) <> "" And arr1(i + 1) <> "" Then
s1 = arr1(i) & arr1(i + 1)
For j = LBound(arr2) To UBound(arr2) If UCase(arr2(j)) = UCase(s1) Then Phrase2 = Substitute(Phrase2, arr2(j), arr1(i) & " " & arr1(i + 1), 1, Compare) GoTo SplitSpace2 End If Next j
End If ' arr(i) = "" Or arr(i + 1) = "" Then
Next i
For j = LBound(arr2) To UBound(arr2) - 1
If arr2(j) <> "" And arr2(j + 1) <> "" Then
s2 = arr2(j) & arr2(j + 1)
For i = LBound(arr1) To UBound(arr1) If UCase(arr1(i)) = UCase(s2) Then Phrase1 = Substitute(Phrase1, arr1(i), arr2(j) & " " & arr2(j + 1), 1, Compare) GoTo SplitSpace1 End If Next i
End If
Next j
' Initialise the positions array with a negative value denoting 'not found'
For i = LBound(arr1) To UBound(arr1) arrPositions(i) = -1 iTotalLen = iTotalLen + Len(arr1(i))Next i
' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.
For i = LBound(arr1) To UBound(arr1)
s1 = arr1(i) dBest = 0 iPos = -1
For j = LBound(arr2) To UBound(arr2)
s2 = arr2(j) dScore = 0
dScore = MatchWord(s1, s2, Compare)
arrScores(i, j) = dScore If dScore > dBest Then dBest = dScore iPos = j End If
Next j
If iPos >= 0 Then arrPositions(i) = iPos End If
Next i
' Resolve collisions - two or more words in P1 that have 'best match' scores on the same word in p2' In theory this could be done without using the positions vector, as the information is in arrScores' In practice, arrPositions saves processing steps
For i = LBound(arrPositions) To UBound(arrPositions)
iPos = arrPositions(i)
For j = i + 1 To UBound(arrPositions)
If iPos = arrPositions(j) And iPos >= 0 Then ' Collision detected: which word has the best score? d1 = arrScores(i, iPos) d2 = arrScores(j, iPos)
If d2 > d1 Then
'discard this recorded 'best match' position: arrScores(i, iPos) = -1
'find the second-best score for d1 dBest = 0 kPos = -1 For k = LBound(arrScores, 2) To UBound(arrScores, 2) dScore = 0 dScore = arrScores(i, k) If dScore > dBest Then dBest = dScore kPos = k End If Next k ' reset this conflicting position as word (i)'s match in phrase 2: arrPositions(j) = kPos ' There is now a possibility that we have caused ' a collision with a previous word in Phrase 1:
If k < i Then For k = LBound(arrPositions) To k - 1 If arrPositions(k) = kPos Then 'restart the loop at the colliding value i = k j = UBound(arr1) + 1 Exit For End If Next k End If ' k<1
Else
' discard this recorded 'best match' position: arrScores(j, iPos) = -1
'find the second-best score for d2 *after* the current position dBest = 0 kPos = -1 For k = j + 1 To UBound(arr2) dScore = 0 dScore = arrScores(j, k) If dScore > dBest Then dBest = dScore kPos = k End If Next k
arrPositions(j) = kPos
End If ' d2 > d1
End If
Next j
Next i
' Constructing a sequence-matching score:
' If we were scoring jumbled sentences of unaltered words, we'd use an edit distance algorithm;' several are available, including replicating the Levenshtein distance at the word level. I've' chosen a crude single-pass algorithm with a forward bias, that 'expects' the word sequence to' resynchronise after each out-of-sequence word. It's quick, and the bias is valid - word-order' is not neutral in real-life examples, and the heavy penalty for word transpositions reflects' my belief that this is a more significant 'edit' than character transpositions in a word. A' more rigorous treatment would venture into the realms of natural-language
processing; that is' out-of-scope for this application and far too ambitious for a self-contained function in VBA.
' Worked example:
' Compare two Phrases:' "ABC DEF GHI JKL MNO PQR STU VWX", "ABC DEF JKL STU MNO PQR VWX"
' Variable arrPositions records the placement of each word in phrase 1 in phrase 2:
' Phrase 1 "ABC DEF GHI JKL MNO PQR STU VWX"' Expected positions: 0 1 2 3 4 5 6 7' Actual position in p2 0 1 -1 2 4 5 3 6
' The variable arrSequence will capture the scores ' Run the sequence-scoring loop:
' ABC expected in position 0 found in 0 Score 1/8' DEF expected in position 1 found in 1 Score 1/8' GHI expected in position 2 DELETION * frame shift -1 * Score NIL' JKL expected in position 3-1 found in 2 Score 1/8' MNO expected in position 4-1 found in 4 * frame shift +1 * Score 1/8 * 7/8' PQR expected in position 5 found in 5 Score 1/8' STU expected in position 6 found in 3 * frame shift -3 * Score 1/8 * (7/8)^3' VWX expected in position 7-3 found in 6 * frame shift +2 * Score 1/8 * (7/8)^2
' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times
' However, we do not deal with perfectly-matched words in real life, so we cannot apply' these penalties at the level of the entire phrase; we apply them at the level of the' individual word, where we can apply a weighting based on each word's Levenshtein score
' The exception is deleted words; we could consider the 'word match' weighting of zero' to be sufficient penalty but a more consistent result is obtained by applying a penalty' to the entire phrase
' Sanity check; run the function in reverse, testing Phrase 2 against phrase 1:
' Phrase 2 "ABC DEF JKL STU MNO PQR VWX"' Expected positions: 0 1 2 3 4 5 6' Actual position in p1 0 1 3 6 4 5 7
' ABC expected in position 0 found in 0 Score 1/8' DEF expected in position 1 found in 1 Score 1/8' JKL expected in position 2 found in 3 * frame shift +1 * Score 1/8 * 7/8' STU expected in position 3+1 found in 6 * frame shift +2 * Score 1/8 *
(7/8)^2' MNO expected in position 4+3 found in 4 * frame shift -3 * Score 1/8 * (7/8)^3' PQR expected in position 5 found in 5 Score 1/8' VWX expected in position 6 found in 7 * frame shift -1 * Score 1/8 * 7/8
' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times
' "But wasn't there an insertion, too? Phrase 1 has an extra word that isn't in Phrase 2!"
' Note that our choice of denominator (8, the longer of the two wordcounts) has the effect of' imputing a score of zero to the inserted word and applying a penalty of 7/8 to the entire phrase.
' A note on identifying the 'inserted word': actually, it's the word in Phrase 1 which didn't' score as 'best match' against any word in Phrase 2. It could've come a close second to any or' all of them.
If UBound(arr1) >= UBound(arr2) Then n = UBound(arr1) + 1Else n = UBound(arr2) + 1End If
dPenalty = 1 - (1 / n)iShift = 0 ' Sequence distance for out-of-place wordsiOffset = 0 ' Running total of this 'shift' variableiDelete = 0 ' Count the number of deletions
For i = LBound(arrPositions) To UBound(arrPositions)
s1 = arr1(i)
iPos = arrPositions(i) iShift = iPos - i - iOffset
Select Case iPos Case Is < 0 'DELETION: no matching word was found in S2
iShift = -1 arrSequence(i) = 0 iDelete = iDelete + 1
Case Is = i + iOffset ' matched word is in the expected position
iShift = 0 arrSequence(i) = 1 / n
Case Else
arrSequence(i) = (dPenalty ^ Abs(iShift)) / n
End Select
iOffset = iOffset + iShift
Next i
MatchPhrase = 0
For i = LBound(arrPositions) To UBound(arrPositions) dScore = 0 If arrPositions(i) > -1 Then dScore = arrScores(i, arrPositions(i)) dScore = dScore * arrSequence(i) Else 'apply a deletion penalty - this isn't as arbitrary as it might seem: it is a equivalent to the ' effect of an insertion, which acts by increasing the denominator dScore = -Len(arr1(i)) / iTotalLen / n End If MatchPhrase = MatchPhrase + dScoreNext i
ExitFunction:
Erase arrScores Erase arrSequence Erase arr1 Erase arr2
End Function
Private Function Minimum(ByVal a As Integer, _ ByVal b As Integer, _ ByVal c As Integer) As IntegerDim min As Integer
min = a
If b < min Then min = b End If
If c < min Then min = c End If
Minimum = min
End Function
Private Function Levenshtein(ByVal s1 As String, ByVal s2 As String) As Integer
' Levenshtein Distance - edit distance between two strings
' THIS CODE IS IN THE PUBLIC DOMAIN
Dim arr() As Integer ' Scoring matrixDim n As Integer ' length of s1Dim m As Integer ' length of s2Dim i As Integer ' iterates through s1Dim j As Integer ' iterates through s2Dim s1_i As String ' ith character of s1Dim s2_j As String ' jth character of s2Dim cost As Integer ' cost
n = Len(s1)m = Len(s2)
If n = 0 Then Levenshtein = m Exit FunctionEnd If
If m = 0 Then Levenshtein = n Exit FunctionEnd If
ReDim arr(0 To n, 0 To m) As Integer
For i = 0 To n arr(i, 0) = iNext i
For j = 0 To m arr(0, j) = jNext j
For i = 1 To n
s1_i = Mid$(s1, i, 1)
For j = 1 To m
s2_j = Mid$(s2, j, 1)
If s1_i = s2_j Then cost = 0 Else cost = 1 End If
arr(i, j) = Minimum(arr(i - 1, j) + 1, arr(i, j - 1) + 1, arr(i - 1, j - 1) + cost)
Next j
Next i
' Step 7
Levenshtein = arr(n, m)
ExitSub: Erase arrEnd Function
Public Function NormaliseAddress(ByVal strAddress As String) As String
' This function is intended to remove or standardise common phrases' and abbreviations used in British postal addresses, allowing the use' of string-comparison algorithms in lists of names and addresses.
' Developers in other countries should review the word list used here,' as conventions probably differ in your local language or dialect.
strAddress = " " & UCase(strAddress) & " "
strAddress = Substitute(strAddress, ",", " ")strAddress = Substitute(strAddress, ".", " ")strAddress = Substitute(strAddress, "-", " ")strAddress = Substitute(strAddress, vbCrLf, " ")strAddress = Substitute(strAddress, " BLVD ", " BOULEVARD ")strAddress = Substitute(strAddress, " BVD ", " BOULEVARD ")strAddress = Substitute(strAddress, " AV ", " AVENUE ")strAddress = Substitute(strAddress, " AVE ", " AVENUE ")strAddress = Substitute(strAddress, " RD ", " ROAD ")strAddress = Substitute(strAddress, " WY ", " WAY ")strAddress = Substitute(strAddress, " EST ", " ESTATE ")strAddress = Substitute(strAddress, " PL ", " PLACE ")strAddress = Substitute(strAddress, " PK ", " PARK ")strAddress = Substitute(strAddress, " HSE ", " HOUSE ")strAddress = Substitute(strAddress, " H0 ", " HOUSE ")strAddress = Substitute(strAddress, " GDNS ", " GARDENS ")
strAddress = Substitute(strAddress, "&", "AND")strAddress = Substitute(strAddress, " LIMITED ", " LTD ")strAddress = Substitute(strAddress, " COMPANY ", " CO ")strAddress = Substitute(strAddress, " CORPORATION ", " CORP ")strAddress = Substitute(strAddress, " T/A ", " TA ")
strAddress = Substitute(strAddress, " TRADING AS ", " TA ")
' Common personal titles: these are often applied inconsistently or' omitted, and must therefore be removed. Specific applications may' require additional titles and their abbreviations - military rank,' academic titles and degrees, courtesy titles of the aristocracy,' knighthoods and honours (particularly for lists of civil servants)
strAddress = Substitute(strAddress, " ESQ ", " ")strAddress = Substitute(strAddress, " MR ", " ")strAddress = Substitute(strAddress, " MRS ", " ")strAddress = Substitute(strAddress, " MISS ", " ")strAddress = Substitute(strAddress, " MS ", " ")strAddress = Substitute(strAddress, " MESSRS ", " ")strAddress = Substitute(strAddress, " SIR ", " ")strAddress = Substitute(strAddress, " OF ", " ")strAddress = Substitute(strAddress, " DR ", " ")strAddress = Substitute(strAddress, " OR ", " ")strAddress = Substitute(strAddress, " IN ", " ")strAddress = Substitute(strAddress, " THE ", " ")strAddress = Substitute(strAddress, " REVEREND ", " REV ")strAddress = Substitute(strAddress, " REVERENT ", " REV ")strAddress = Substitute(strAddress, " HONOURABLE ", " HON ")strAddress = Substitute(strAddress, " BROS ", " BROTHERS ")strAddress = Substitute(strAddress, " ASSOC ", " ASSOCIATION ")strAddress = Substitute(strAddress, " ASSN ", " ASSOCIATION ")
' Standardising 'St.', 'St', and 'Street'. Note that there are over 40 English' towns and place names that contain or consist entirely of the word 'Street'.' In addition, 'St' is a common abbreviation for 'Saint' in addresses.
' I have never seen a list of addresses where 'Street' and 'St' were used in a' consistent way, and the only workable solution is to delete them all:
strAddress = Substitute(strAddress, " STREET ", " ")strAddress = Substitute(strAddress, " ST ", " ")strAddress = Substitute(strAddress, " STR ", " ")
Do While InStr(strAddress, " ") > 0 strAddress = Substitute(strAddress, " ", " ")Loop
strAddress = Trim(strAddress)
NormaliseAddress = strAddress
End Function
Public Function StripChars(myString As String, ParamArray Exceptions()) As String
' Strip out all non-alphanumeric characters from a string in a single pass' Exceptions parameters allow you to retain specific characters (eg: spaces)
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim i As IntegerDim iLen As IntegerDim chrA As String * 1Dim intA As IntegerDim j As IntegerDim iStart As IntegerDim iEnd As Integer
If Not IsEmpty(Exceptions()) Then iStart = LBound(Exceptions) iEnd = UBound(Exceptions)End If
iLen = Len(myString)
For i = 1 To iLen chrA = Mid(myString, i, 1) intA = Asc(chrA) Select Case intA Case 48 To 57, 65 To 90, 97 To 122 StripChars = StripChars & chrA Case Else If Not IsEmpty(Exceptions()) Then For j = iStart To iEnd If chrA = Exceptions(j) Then StripChars = StripChars & chrA Exit For ' j End If Next j End If End SelectNext i
End Function
Private Function Substitute(ByVal Text As String, _ ByVal Old_Text As String, _ ByVal New_Text As String, _ Optional Instance As Integer = 0, _ Optional Compare As VbCompareMethod = vbTextCompare _ ) As String
'Replace all instances (or the nth instance ) of 'Old' text with 'New'' Unlike VB.Mid$ this method is not sensitive to length and can replace ALL instances' This is not exposed as a Public function because there is an Excel Worksheet function called Substitute()
' THIS CODE IS IN THE PUBLIC DOMAIN
Dim iStart As IntegerDim iEnd As IntegerDim iLen As IntegerDim iInstance As IntegerDim strOut As String
iLen = Len(Old_Text)
If iLen = 0 Then Substitute = Text Exit FunctionEnd If
iEnd = 0iStart = 1
iEnd = InStr(iStart, Text, Old_Text, Compare)
If iEnd = 0 Then Substitute = Text Exit FunctionEnd If
strOut = ""
Do Until iEnd = 0
strOut = strOut & Mid$(Text, iStart, iEnd - iStart) iInstance = iInstance + 1
If Instance = 0 Or Instance = iInstance Then strOut = strOut & New_Text Else strOut = strOut & Mid$(Text, iEnd, Len(Old_Text)) End If
iStart = iEnd + iLen iEnd = InStr(iStart, Text, Old_Text, Compare)
Loop
iLen = Len(Text)strOut = strOut & Mid$(Text, iStart, iLen - iEnd)
Substitute = strOut
End Function
Public Function Contains(ByVal MainString As String, ParamArray SeekString()) As Boolean
' A version of Instr() that returns TRUE if any 'seek' string is a substring of the
main string
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim i As IntegerDim j As IntegerDim arrX As VariantDim strTest As StringContains = False
If MainString = "" Then Exit FunctionEnd If
For i = LBound(SeekString) To UBound(SeekString)
If IsArray(SeekString(i)) Then arrX = SeekString(i) For j = LBound(arrX) To UBound(arrX)
strTest = "" strTest = Trim(arrX(j))
If Len(strTest) <= Len(MainString) And Len(strTest) > 0 Then
If InStr(1, MainString, strTest, vbTextCompare) > 0 Then Contains = True Exit Function End If
End If
Next j Erase arrX
Else
strTest = "" strTest = Trim(SeekString(i))
If Len(strTest) > 0 Then
If Len(strTest) < Len(MainString) And Len(strTest) > 0 Then
If InStr(1, MainString, strTest, vbTextCompare) > 0 Then Contains = True Exit Function End If
End If
End If
End If
Next i
End Function
There's a particular point about the Phrase-matching function that could probably be improved:
' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.
Access can incorporate globally-declared VBA functions in queries, so try this out:
SELECT TOP 3
"Buckingham Palace, LONDON SW1" AS SeekAddress, MatchPhrase( NormaliseAddress("Buckingham Palace, LONDON SW1"), NormaliseAddress([tblAddress].[FullAddress]) ) AS MatchScore, tblAddress.*
FROM tblAddress
ORDER BY MatchPhrase( NormaliseAddress("Buckingham Palace, LONDON SW1"), NormaliseAddress([tblAddress].[FullAddress]) ) DESC
Function SOUNDEX(Surname As String) As String' The Custom Function "SOUNDEX" used in this macro was Developed by Richard J. Yanco.' This function follows the Soundex rules given at' http://home.utah-inter.net/kinsearch/Soundex.html' The VB Application "Soundex_Codes; By: Joe Was & John Walkenbach.
Dim Result As String, c As String * 1Dim Location As Integer
Surname = UCase(Surname)
' First character must be a letterIf Asc(Left(Surname, 1)) < 65 Or Asc(Left(Surname, 1)) > 90 ThenSOUNDEX = ""Exit FunctionElse' St. is converted to SaintIf Left(Surname, 3) = "ST." ThenSurname = "SAINT" & Mid(Surname, 4)End If
' Convert to Soundex: letters to their appropriate digit,' A,E,I,O,U,Y ("slash letters") to slashes' H,W, and everything else to zero-length string
Result = Left(Surname, 1)For Location = 2 To Len(Surname)Result = Result & Category(Mid(Surname, Location, 1))Next Location
' Remove double lettersLocation = 2Do While Location < Len(Result)If Mid(Result, Location, 1) = Mid(Result, Location + 1, 1) ThenResult = Left(Result, Location) & Mid(Result, Location + 2)ElseLocation = Location + 1End IfLoop
' If category of 1st letter equals 2nd character, remove 2nd characterIf Category(Left(Result, 1)) = Mid(Result, 2, 1) ThenResult = Left(Result, 1) & Mid(Result, 3)End If
' Remove slashesFor Location = 2 To Len(Result)If Mid(Result, Location, 1) = "/" ThenResult = Left(Result, Location - 1) & Mid(Result, Location + 1)End IfNext
' Trim or pad with zeroes as necessarySelect Case Len(Result)Case 4SOUNDEX = ResultCase Is < 4SOUNDEX = Result & String(4 - Len(Result), "0")Case Is > 4SOUNDEX = Left(Result, 4)End SelectEnd IfEnd Function
Private Function Category(c) As String' Returns a Soundex code for a letterSelect Case TrueCase c Like "[AEIOUY]"Category = "/"Case c Like "[BPFV]"Category = "1"Case c Like "[CSKGJQXZ]"Category = "2"Case c Like "[DT]"Category = "3"Case c = "L"Category = "4"Case c Like "[MN]"Category = "5"Case c = "R"Category = "6"Case Else 'This includes H and W, spaces, punctuation, etc.Category = ""End SelectEnd Function