2003 exchange progress bp1110: close enough indexed record retrieval in progress using sound-alikes...
TRANSCRIPT
2003 Exchange
PROGRESS
BP1110:Close Enough Indexed Record Retrieval In Progress Using Sound-alikes and Near Matches
Steve Southwell ([email protected])Senior ConsultantBravePoint, Inc.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Steve Southwell
Employee of BravePoint
Consultant specializing in Progress web-enablement
Business systems analyst
Dallas, Texas based
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Steve Southwell
Employee of BravePoint
Consultant specializing in Progress web-enablement
Business systems analyst
Dallas, Texas based
Just my day job until I get my record contract. Yeah, Baby!!!
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
The Problem - User Perspective
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
The Problem - User Perspective
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
The Problem - User Perspective
Users expect intuitive text searches.Google and other consumer-oriented web sites have raised the bar.
Find what I'm looking for – not what I typed.
It's not my problem if I'm a bad speller
Oh yeah... Put the most interesting results at the top of the list.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
The Problem – User Perspective
Users do not know “contains” syntax.More users know about quotes and the use of “and” or “or”.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Scope of this Talk
Various tools for making searches work better
General Techniques
Examples
Specific code
Technical Analysis
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Disclaimers!
There is no “one-size-fits-all”.
You may trade performance for results.
Some techniques incompatible with each other.
It all depends on the nature of the data.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Disclaimers!
This talk is more about theory and methods.
Your mileage may vary.
Batteries not included.
Do not remove this tag under penalty of law.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Questions?
• Feel free to ask questions anytime.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Types of Searches Where Close Counts
Product Searches
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Target Smart Searching Example
User Can't Spell!
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Amazon Smart Searching Example
User Can't Spell!
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Types of Searches Where Close Counts
Product Searches
Searches for Proper Names
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Yellow Pages Smart Searching
User Can't Spell!
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Google Smart Searching Example
User Can't Spell!
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Types of Searches Where Close Counts
Product Searches
Searches for Proper Names
Full-text Searches
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
AltaVista Smart Searching ExampleUser Can't Spell!
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
The Problem – Developer Perspective
Internal users need quick results. Time is money.
If customers want to to buy, I'll help them find it.
If they can't spell it, we still sell it.
A widget by any other name... It's still for sale.
List the good stuff first.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Technical Issues
How can Progress store what a word sounds like?How do I search for sound-alikes or similar words?
How can I rank search results?
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Determining What a Word Sounds Like
SoundexUsed by US Census Bureau since 1880
Intended to index surnames
Only codes starting letter and 3 sounds
Had to be simple enough to do by hand.
1 = B, P, F, V 4 = L 2 = C, S, K, G, J, Q, X, Z 5 = M,N 3 = D, T 6 = R
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Soundex Examples
Last Name: Southwell
Soundex: S340
First letter = S
Next consonant = T = 3
H & W not represented.
Next consonant = L = 4
Next L is a double – skip
Pad with 0
1 = B, P, F, V 4 = L 2 = C, S, K, G, J, Q, X, Z 5 = M,N 3 = D, T 6 = R
Other S340 Names:Seidl, Steele, Staley, Stahl, Stahley, Seidel, Settle, Shadle, Shotwell, Shuttle, Sidwell, Southall, Stall, Steel, Steely, Stell, Still, Stoll, Stowell, Stull, Sudlow, Suttle
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
src/samples/soundex.pDEFINE INPUT PARAMETER name AS CHARACTER NO-UNDO.DEFINE OUTPUT PARAMETER code AS CHARACTER NO-UNDO.
DEFINE VARIABLE e AS INTEGER NO-UNDO.DEFINE VARIABLE i AS INTEGER NO-UNDO.DEFINE VARIABLE k AS CHARACTER NO-UNDO.DEFINE VARIABLE l AS CHARACTER NO-UNDO.
ASSIGN l = "" name = CAPS(name) code = SUBSTRING(name,1,1).DO i = 2 TO LENGTH(name): e = ASC(SUBSTRING(name,i,1)) - 64. IF e >= 1 AND e <= 26 THEN DO: k = SUBSTRING("01230120022455012623010202",e,1). IF k <> l AND k <> "0" THEN code = code + k. IF LENGTH(code) > 3 THEN LEAVE. END. l = k.END.code = SUBSTRING(code + "000",1,4).RETURN.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Soundey
More sound codes
Indexes vowel positions
Codes the entire word
Makes phonetic substitutions
0 = aehiouwy 5 = mn1 = bp 6 = r 2 = ckqx 7 = fv 3 = dt 8 = gj 4 = l 9 = sz
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
0 = aehiouwy 5 = mn1 = bp 6 = r 2 = ckqx 7 = fv 3 = dt 8 = gj 4 = l 9 = sz
Soundey – Continued
Soundeylib.i available free at www.FreeFrameWork.org
More sophisticated than Soundex
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Steps in Soundey Conversion
Pre-tokenMark word boundaries
“Anywhere” translations
“Ends” translations
“Begins” translations
Eliminate silent E
Unmark word boundaries
Translate characters to digits
Eliminate double digits
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Soundey Example
Word: Telephone Soundey: 3040705
Replace 'ph' with 'f': telefone
Eliminate silent 'e' on the end: telefon
Translate characters to digits:
T = 3, E = 0, L=4, E=0, F=7, O=0, N=5:
3040705
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
0 = aehiouwy 5 = mn1 = bp 6 = r 2 = ckqx 7 = fv 3 = dt 8 = gj 4 = l 9 = sz
Soundey – Disadvantages
Not as good as Metaphone
Presents problems when there are digits possible in the search target or search string.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
0 = th sound h = h*b = b l = l x = ch,sh sounds m = m s = s, some c n = n k = k, some c,g p = pj = j, some g r = rt = t, d w = w*f = f, v y = y**mostly silent
Metaphone
Published in 1990 by Lawrence Philips
Reduces alphabet to 16 consonant sounds
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Metaphone – Continued
Less fuzzy than Soundex or Soundey
Uses many English spelling heuristics to convert odd spellings to correct sounds.
Progress version available at http://www.freeframework.org/downloads/new/wordnet/
Not a strict standard
Have a look at metaphonerules.d
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Technical Issues
How can Progress determine what a word sounds like?
How do I search for sound-alikes or similar words?How can I rank search results?
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Storing the Sound-like Value
Add field(s) to your target table – one or two per target field
For example: If searching against Item.ItemName
Add Item.MetaphoneCode.
Add Item.MetaphoneFragments.
Both word-indexed.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Metaphone and Fragments
You can use triggers to keep your fragment list up-to-date.
WordChop() fragments single words.
SuperWordChop() does sentences.
Searching for “ball*” would now find both baseball and balloon.
Storing fragments in metaphone allows for fuzzy partial matches!
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Fragments?
Standard Progress word-indexing only matches against the beginning of words.Contains “*ball*” is a syntax error
How would you match “ball” with “baseball”?
Fragment field contains this:Baseball aseball seball eball ball
Don't store fragments under 4 characters.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Populating Metaphone Fields in DB
{lib/metaphone.i}...FOR EACH ITEM EXCLUSIVE-LOCK: ASSIGN ITEM.MetaphoneCode = toMetaphone(ITEM.ItemName +" " + ITEM.CatDescription).
Item.MetaphoneFragments = superWordChop(Item.MetaphoneCode).
END....
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Using Metaphone in 4gl Queries
MySearch = toMetaphone(MySearch).
FOR EACH ITEM WHERE ITEM.MetaphoneCode CONTAINS mySearch NO-LOCK:
...
END.
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Metaphone Use in 4gl
Demo of Sports2000 item search with Soundey: itemsearch.w
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
General Metaphone Query Tips
Try regular “contains” search first.
Convert search string to Metaphone code, and do “contains” search on MetaphoneCode field.
Try Split and Rejoin
Other alternatives:Synonym and Related word searches
Neural Networks with User Feedback
Forced Ranking
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Metaphone Extensibility
Can make it replace known words or fragments:
Anywhere
Beginning of words
Ending of words
GUI demonstration – FunctionTester.w
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Other Search Issues
Unexpected Boolean operators. “and” is default
Users want to use the words “and” and “or”
Use booleanConvert() on the query string.
Hyphens / Compound Wordsdehyphenize()
Word SynonymsSee thesaurus.i and itemsearch.w for examples
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Other Search Issues
Numbers and Ordinals29 Palms / Twentynine Palms
5th Inning / Fifth Inning
Abbreviations / SlangFt. Worth, TX / Fort Worth, Texas
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Technical Issues
How can Progress store what a word sounds like?
How do I search for sound-alikes or similar words?
How can I rank search results?
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Ranking Search Results
Not an exact science
Can use many criteria:Number of word matches
Similarity to key words
“Preferred” results – upsells, recent additions, etc.
Requires use of temp-table for results.
All results must be analyzed, so keep set small. (MAX-ROWS?)
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Search Ranking Demonstration
Itemsearch.w
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Technical Issues
How can Progress store what a word sounds like?
How do I search for sound-alikes or similar words?
How can I rank search results?
CopyLeft 2003BP1110: Close Enough - 1Sim
plify
you
r bu
sin
ess
Sim
plify
you
r bu
sin
ess
2003 Exchange
PROGRESS
Source Code Availability
All source code used in this presentation can be found at the FreeFrameWork website: http://www.freeframework.org
Up-to-date copy of this presentation available with the source code at the FreeFrameWork site.