Download - Fuzzy Hash Map
![Page 1: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/1.jpg)
Efficient Fuzzy Search Enabled Hash Map
4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA
Vasile TopacPhD Student
Department of Information Technology and Computer Science“Politehnica” University Of Timisoara
Email: [email protected]
![Page 2: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/2.jpg)
How it all started
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
&
![Page 3: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/3.jpg)
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- widely used Java data structure
- stores (key, value) pairs
- search by key
- very fast
-a hash function generates a hash code for indexation
- Uses equals method to compare trough the keys
- only values for existing keys can be retrieved
![Page 4: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/4.jpg)
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
phone book example
![Page 5: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/5.jpg)
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Collision
![Page 6: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/6.jpg)
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Search for “Lisa Smith”
hashMap.get(“Lisa Smith”);Result: “521-8976”
![Page 7: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/7.jpg)
Problem
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- only values for existing keys can be retrieved
Search for “Lissa Smith”
hashMap.get(“Lissa Smith”);Result: null
![Page 8: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/8.jpg)
Problem
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Brute force solution: - iterate trough the set of entries and search approximate matches Works, but is time expensive Fuzzy data structures – currently available for database
- search for “Lissa Smith”
![Page 9: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/9.jpg)
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
“ Soft computing (SC) is a collection of methodologies that are trying to cope with the main disadvantage of the conventional (hard) computing: the poor performances when working in uncertain conditions. ”
![Page 10: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/10.jpg)
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
UML Class Diagram
![Page 11: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/11.jpg)
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
FuzzyKey overridden methods
- hashCode()- prehashing - create collisions to cluster data
- substring substring(“Fuzzy Search”, 0, 4) = “Fuzz”- soundex soundex(“Fuzzy Search”) = F226
- equals(Object o)- string metrics
- Levenshtain Distance LD(computing, computation)=4- Hamming Distance HD(computing, computers)=3
How it works
![Page 12: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/12.jpg)
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Example(law terminology dictionary)
- hashCode()- prehashing
- substring 4
- equals(Object o)- Levenshtain Distance
SUBSTRING (0, 4)
action
adjudication
evidence
violence
violation
...
...
hashfunction
pre-hashingfunction buckets
acti
adju
evid
viol
action
adjudication
evidence
violence
violation
12
13
14
215
A civil judicial proceeding ...
A decision or sentence imposed by a judge...
The expression of physical or verbal ...
An offense for which the only sentence ...
Testimony, documents or objects ...
...
......
...
......
......
![Page 13: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/13.jpg)
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
“the judge has the option of either adjudicating you as guilty or..”
fuzzyHashMap.get(“adjudicating”) = nullfuzzyHashMap.getFuzzy(“adjudicating”, 2) = “a decision or sentence
imposed by a
judge…”
- hashCode()substring 4 = “adju”
- equals(Object o)LD(adjudicating, adjudication) = 2
![Page 14: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/14.jpg)
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
fuzzyHashMap.getFuzzy(“violent”)= “violence”
SUBSTRING (0, 4)
action
adjudication
evidence
violence
violation
...
...
hashfunction
pre-hashingfunction buckets
acti
adju
evid
viol
action
adjudication
evidence
violence
violation
12
13
14
215
A civil judicial proceeding ...
A decision or sentence imposed by a judge...
The expression of physical or verbal ...
An offense for which the only sentence ...
Testimony, documents or objects ...
...
......
...
......
......
LD(violent, violence) = 2LD(violent, violation) = 5
“violence” is returned
![Page 15: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/15.jpg)
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
SOUNDEX
Mary
Paul
Scott
Jhon
John
...
...
hashfunction
pre-hashingfunction buckets
M600
P400
S300
J500
Mary
Paul
Scott
Jhon
John
12
13
14
215
312050505
732124789
025465892
361475236
712696969
...
......
...
......
......
Example(phone book)
- hashCode()- prehashing
- soundex
- equals(Object o)- Levenshtain Distance
![Page 16: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/16.jpg)
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Accuracy Test
Test conditions- Substring(0,4) hashing function- Levenshtein Distance fuzzy matching algorithm- Distance threshold value 2- medical terminology dictionary populated with 1030 English medical terms
Test results
-Parse text from American Family Physicians Journal - text of 568 words- 43 words identified as medical terms- 9 were incorrect matches- 80% accuracy
- Parse text from eMedicine web site - text of 2730 words- 260 were recognized- 7 were incorrect matches- 97% accuracy
![Page 17: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/17.jpg)
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Exact matches only
0 100 200 300 400 500 600 700 800 90010000
1000
2000
3000
4000
5000
6000
4 5 5 6
5419
4013
2300
1
HashMap
FuzzyHashMap
![Page 18: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/18.jpg)
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Fuzzy matches only
010
020
030
040
050
060
070
080
090
010
000
1000
2000
3000
4000
5000
6000
7000
4 5 6 7
54195739 5711
5401
HashMap
FuzzyHashMap
![Page 19: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/19.jpg)
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Exact & fuzzy matches
010
020
030
040
050
060
070
080
090
010
000
1000
2000
3000
4000
5000
6000
4 5 5 6
5419
4744
4135
3143HashMap
FuzzyHashMap
![Page 20: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/20.jpg)
Conclusion
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- FuzzyHashMap data structures proved to have very good performance on working with uncertain data
- Flexible (can choose different pre-hashing functions and string metrics)
- available as open source http://fuzzyhashmap.sourceforge.net/
- community can extend the functionality
- Future work: - adding more string metrics- improve performance- implement Fuzzy TreeMap
![Page 21: Fuzzy Hash Map](https://reader038.vdocument.in/reader038/viewer/2022102603/5462ac02b4af9f711c8b48b7/html5/thumbnails/21.jpg)
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Thank you!
sources at:http://fuzzyhashmap.sourceforge.net