2003 exchange progress bp1110: close enough indexed record retrieval in progress using sound-alikes...

49
2003 Exchange PROGRESS BP1110: Close Enough Indexed Record Retrieval In Progress Using Sound-alikes and Near Matches Steve Southwell ([email protected]) Senior Consultant BravePoint, Inc.

Upload: eustacia-newton

Post on 03-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

2003 Exchange

PROGRESS

BP1110:Close Enough Indexed Record Retrieval In Progress Using Sound-alikes and Near Matches

Steve Southwell ([email protected])Senior ConsultantBravePoint, Inc.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Steve Southwell

Employee of BravePoint

Consultant specializing in Progress web-enablement

Business systems analyst

Dallas, Texas based

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Steve Southwell

Employee of BravePoint

Consultant specializing in Progress web-enablement

Business systems analyst

Dallas, Texas based

Just my day job until I get my record contract. Yeah, Baby!!!

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

The Problem - User Perspective

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

The Problem - User Perspective

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

The Problem - User Perspective

Users expect intuitive text searches.Google and other consumer-oriented web sites have raised the bar.

Find what I'm looking for – not what I typed.

It's not my problem if I'm a bad speller

Oh yeah... Put the most interesting results at the top of the list.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

The Problem – User Perspective

Users do not know “contains” syntax.More users know about quotes and the use of “and” or “or”.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Scope of this Talk

Various tools for making searches work better

General Techniques

Examples

Specific code

Technical Analysis

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Disclaimers!

There is no “one-size-fits-all”.

You may trade performance for results.

Some techniques incompatible with each other.

It all depends on the nature of the data.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Disclaimers!

This talk is more about theory and methods.

Your mileage may vary.

Batteries not included.

Do not remove this tag under penalty of law.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Questions?

• Feel free to ask questions anytime.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Types of Searches Where Close Counts

Product Searches

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Target Smart Searching Example

User Can't Spell!

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Amazon Smart Searching Example

User Can't Spell!

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Types of Searches Where Close Counts

Product Searches

Searches for Proper Names

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Yellow Pages Smart Searching

User Can't Spell!

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Google Smart Searching Example

User Can't Spell!

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Types of Searches Where Close Counts

Product Searches

Searches for Proper Names

Full-text Searches

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

AltaVista Smart Searching ExampleUser Can't Spell!

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

The Problem – Developer Perspective

Internal users need quick results. Time is money.

If customers want to to buy, I'll help them find it.

If they can't spell it, we still sell it.

A widget by any other name... It's still for sale.

List the good stuff first.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Technical Issues

How can Progress store what a word sounds like?How do I search for sound-alikes or similar words?

How can I rank search results?

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Determining What a Word Sounds Like

SoundexUsed by US Census Bureau since 1880

Intended to index surnames

Only codes starting letter and 3 sounds

Had to be simple enough to do by hand.

1 = B, P, F, V 4 = L 2 = C, S, K, G, J, Q, X, Z 5 = M,N 3 = D, T 6 = R

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Soundex Examples

Last Name: Southwell

Soundex: S340

First letter = S

Next consonant = T = 3

H & W not represented.

Next consonant = L = 4

Next L is a double – skip

Pad with 0

1 = B, P, F, V 4 = L 2 = C, S, K, G, J, Q, X, Z 5 = M,N 3 = D, T 6 = R

Other S340 Names:Seidl, Steele, Staley, Stahl, Stahley, Seidel, Settle, Shadle, Shotwell, Shuttle, Sidwell, Southall, Stall, Steel, Steely, Stell, Still, Stoll, Stowell, Stull, Sudlow, Suttle

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

src/samples/soundex.pDEFINE INPUT PARAMETER name AS CHARACTER NO-UNDO.DEFINE OUTPUT PARAMETER code AS CHARACTER NO-UNDO.

DEFINE VARIABLE e AS INTEGER NO-UNDO.DEFINE VARIABLE i AS INTEGER NO-UNDO.DEFINE VARIABLE k AS CHARACTER NO-UNDO.DEFINE VARIABLE l AS CHARACTER NO-UNDO.

ASSIGN l = "" name = CAPS(name) code = SUBSTRING(name,1,1).DO i = 2 TO LENGTH(name): e = ASC(SUBSTRING(name,i,1)) - 64. IF e >= 1 AND e <= 26 THEN DO: k = SUBSTRING("01230120022455012623010202",e,1). IF k <> l AND k <> "0" THEN code = code + k. IF LENGTH(code) > 3 THEN LEAVE. END. l = k.END.code = SUBSTRING(code + "000",1,4).RETURN.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Soundey

More sound codes

Indexes vowel positions

Codes the entire word

Makes phonetic substitutions

0 = aehiouwy 5 = mn1 = bp 6 = r 2 = ckqx 7 = fv 3 = dt 8 = gj 4 = l 9 = sz

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

0 = aehiouwy 5 = mn1 = bp 6 = r 2 = ckqx 7 = fv 3 = dt 8 = gj 4 = l 9 = sz

Soundey – Continued

Soundeylib.i available free at www.FreeFrameWork.org

More sophisticated than Soundex

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Steps in Soundey Conversion

Pre-tokenMark word boundaries

“Anywhere” translations

“Ends” translations

“Begins” translations

Eliminate silent E

Unmark word boundaries

Translate characters to digits

Eliminate double digits

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Soundey Example

Word: Telephone Soundey: 3040705

Replace 'ph' with 'f': telefone

Eliminate silent 'e' on the end: telefon

Translate characters to digits:

T = 3, E = 0, L=4, E=0, F=7, O=0, N=5:

3040705

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

0 = aehiouwy 5 = mn1 = bp 6 = r 2 = ckqx 7 = fv 3 = dt 8 = gj 4 = l 9 = sz

Soundey – Disadvantages

Not as good as Metaphone

Presents problems when there are digits possible in the search target or search string.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

0 = th sound h = h*b = b l = l x = ch,sh sounds m = m s = s, some c n = n k = k, some c,g p = pj = j, some g r = rt = t, d w = w*f = f, v y = y**mostly silent

Metaphone

Published in 1990 by Lawrence Philips

Reduces alphabet to 16 consonant sounds

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Metaphone – Continued

Less fuzzy than Soundex or Soundey

Uses many English spelling heuristics to convert odd spellings to correct sounds.

Progress version available at http://www.freeframework.org/downloads/new/wordnet/

Not a strict standard

Have a look at metaphonerules.d

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Technical Issues

How can Progress determine what a word sounds like?

How do I search for sound-alikes or similar words?How can I rank search results?

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Storing the Sound-like Value

Add field(s) to your target table – one or two per target field

For example: If searching against Item.ItemName

Add Item.MetaphoneCode.

Add Item.MetaphoneFragments.

Both word-indexed.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Metaphone and Fragments

You can use triggers to keep your fragment list up-to-date.

WordChop() fragments single words.

SuperWordChop() does sentences.

Searching for “ball*” would now find both baseball and balloon.

Storing fragments in metaphone allows for fuzzy partial matches!

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Fragments?

Standard Progress word-indexing only matches against the beginning of words.Contains “*ball*” is a syntax error

How would you match “ball” with “baseball”?

Fragment field contains this:Baseball aseball seball eball ball

Don't store fragments under 4 characters.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Populating Metaphone Fields in DB

{lib/metaphone.i}...FOR EACH ITEM EXCLUSIVE-LOCK: ASSIGN ITEM.MetaphoneCode = toMetaphone(ITEM.ItemName +" " + ITEM.CatDescription).

Item.MetaphoneFragments = superWordChop(Item.MetaphoneCode).

END....

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Using Metaphone in 4gl Queries

MySearch = toMetaphone(MySearch).

FOR EACH ITEM WHERE ITEM.MetaphoneCode CONTAINS mySearch NO-LOCK:

...

END.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Metaphone Use in 4gl

Demo of Sports2000 item search with Soundey: itemsearch.w

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

General Metaphone Query Tips

Try regular “contains” search first.

Convert search string to Metaphone code, and do “contains” search on MetaphoneCode field.

Try Split and Rejoin

Other alternatives:Synonym and Related word searches

Neural Networks with User Feedback

Forced Ranking

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Metaphone Extensibility

Can make it replace known words or fragments:

Anywhere

Beginning of words

Ending of words

GUI demonstration – FunctionTester.w

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Other Search Issues

Unexpected Boolean operators. “and” is default

Users want to use the words “and” and “or”

Use booleanConvert() on the query string.

Hyphens / Compound Wordsdehyphenize()

Word SynonymsSee thesaurus.i and itemsearch.w for examples

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Other Search Issues

Numbers and Ordinals29 Palms / Twentynine Palms

5th Inning / Fifth Inning

Abbreviations / SlangFt. Worth, TX / Fort Worth, Texas

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Technical Issues

How can Progress store what a word sounds like?

How do I search for sound-alikes or similar words?

How can I rank search results?

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Ranking Search Results

Not an exact science

Can use many criteria:Number of word matches

Similarity to key words

“Preferred” results – upsells, recent additions, etc.

Requires use of temp-table for results.

All results must be analyzed, so keep set small. (MAX-ROWS?)

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Search Ranking Demonstration

Itemsearch.w

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Technical Issues

How can Progress store what a word sounds like?

How do I search for sound-alikes or similar words?

How can I rank search results?

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

Source Code Availability

All source code used in this presentation can be found at the FreeFrameWork website: http://www.freeframework.org

Up-to-date copy of this presentation available with the source code at the FreeFrameWork site.

CopyLeft 2003BP1110: Close Enough - 1Sim

plify

you

r bu

sin

ess

Sim

plify

you

r bu

sin

ess

2003 Exchange

PROGRESS

All questionsanswered...

Stump the Chump