ecoterm iv nbii/eionet demo of federated kos search mike frame vienna, austria april 2007
TRANSCRIPT
EcoTerm IVNBII/EioNet Demo of Federated
KOS Search
Mike Frame
Vienna, Austria
April 2007
Discussion Topics…
• Project Background• NBII Thesaurus• GEMET Thesaurus• Prototype Client• Sample Query Results
• Including no, 1, or both thesauri • Overall Findings
Biocomplexity Thesaurushttp://thesaurus.nbii.gov
http://thesaurus.nbii.gov
EIONET GEMET Thesaurushttp://www.eionet.europa.eu/gemet/webservices?langcode=en
NBII/EIONET Thesaurus Web-service
1
• Background - collaboration through Ecoinformatics TWG • Primary Goal – access distributed multi-lingual thesauri• Results – SKOS web-service & client
Latest Client & Service capabilities Access to both NBII and GEMET Single language capability Results are provided by source All documentation is completed
http://thesaurus.nbii.gov
Demo Client
Initial Challenges Identified
Thesaurus scope, intent, purpose, and coverage is different • NBII = sub-discipline of environment
• Endangered species
• Broader Terms:Species , Special status species , Taxa
• EIOINET = broad environment• Broader Terms:environmental protection
Current State
Users• Most aren’t aware of the underlying vocabulary
Vocabulary are often unique to organization and more for “categorization” than retrieval
Goal• Include all Vocabularies and let Search Engine
handle results
Demonstration Search Retrieval
Created a demonstration datasets
• NBII Cataloged Resources
•~30,000 web-sites, publications, images, maps, etc.
•Xml structured data – controlled subject
• NBII FGDC Metadata
•~22,000 resources on research studies
• 150-200 elements
•Semi-structured with no controlled vocabulary
NBII Catalog Records
Based on the Dublin Core + 18 elements, of which 10 are mandatory In place since 2002 Used by distributed content managers
NBII Metadata CH
Process Added thesaurus capabilities to Development
Search Engine for: • NBII Thesaurus
• EIONET GEMET Thesaurus
• Used BT, RT, NT relationships & weighting
Performed sample queries within the test repositories for:• No thesaurus
• GEMET only aided searching
• NBII only aided searching
• GEMET+NBII aided searching (X)
Test Repository 1
NBII Resource Catalog (Dublin Core)
No Thesauri – “invasive species”
NBII Thesaurus – “invasive species”
GEMET Thesaurus – “invasive species”
No Thesauri – “Endangered Species”
NBII Thesaurus – “endangered species”
GEMET Only – “endangered species”
No Thesaurus – “rare species”
NBII Thesaurus – “rare species”
GEMET Thesaurus – “rare species”
GEMET Thesaurus – “rare species” (expanded degrees of relevance)
No Thesauri – “protected species”
NBII Thesaurus – “protected species”
GEMET Thesaurus – “protected species”
Results – NBII Catalog Resources
term None NBII GEMET
“invasive species”
2487 10802 2487
“endangered species”
1612 3532 1619
“rare species”
“rare species” (expanded)
249 7186 290
5847
“”protected species”
203 2345 1664
Results – NBII Resource Catalog
0
2000
4000
6000
8000
10000
12000
Invasive
spec ies
endangered
spec ies
rare spec ies protec ted
spec ies
None NBII GEMET
Test Repository 2
NBII FGDC Metadata
Sample Queries – No vocabulariesMetadata CH “ invasive species”
Sample Queries – NBII onlyMetadata CH “invasive species”
Sample Queries – GEMET onlyMetadata CH
“ invasive species”
Sample Queries – No vocabulariesMetadata CH
“endangered species”
Sample Queries – NBII onlyMetadata CH
“endangered species”
Sample Queries – GEMET onlyMetadata CH
“ endangered species”
No Thesauri – Metadata CH“rare species”
NBII Thesaurus – Metadata CH “rare species”
GEMET Thesaurus – Metadata CH“rare species”
Sample Queries – No vocabulariesMetadata CH “protected species”
Sample Queries – NBII onlyMetadata CH
“protected species”
Sample Queries – GEMET onlyMetadata CH
“ protected species”
Results – FGDC Metadata
term None NBII GEMET
“invasive species”
302 7884 302
“endangered species”
1008 2690 1019
“rare species” 59 4259 64
“protected species”
11 2152 1011
Results – NBII Resource Catalog
0
1000
2000
3000
4000
5000
6000
7000
8000
Invasive
spec ies
endangered
spec ies
rare spec ies protec ted
spec ies
None NBII GEMET
Overall ResultsGeneral Findings
Assumption that a Thesaurus improves “number” of results is valid• Degree does vary by the term and mappings
Since users search from a # of perspectives, backgrounds, expertise, multiple thesaurus do improve the number of results
Overall ResultsUsing only GEMET Terminology
Terms not included in the NBII thesaurus that were in GEMET improved search results
GEMET strength of broad coverage aided searches
In General for the Metadata repository• Results varied somewhat, but often same
top 10 results
Overall ResultsGeneral Findings
With “No thesaurus” test results produced poorer #1 results
Thesaurus results for the structured set ordered results list more differently than unstructured set (Metadata)
Issues
“integrating” multi-scope and purpose thesauri presents challenges:• Can’t turn the effort into a thesaurus project
• Degrees of relevance of terms is an issue
• Concept matching or different intent
• Differing classification (RT vs. NT) across thesauri
• Differing “weighting” algorithms
Further Study Options
1.) Take multiple thesauri “as is”2.) Do some “attempted” concept
matchingi.e. “endangered animal species” –
“endangered animal”
3.) If not match is present, add term and relationship as is
4.) Obtain terms from XMDR
Further Study Options – cont.
Follow-up with additional repositories Repeat with other query terms Re-look at weighting algorithms Do queries with subset of terms Repeat with completely integrated
thesaurus as compared to>>>>>>> Repeat queries with machine integration
Complete By June
Questions, Comments,
GEMET Control file
endangered species,category of endangered species[.2],endangered animal species[0.8],endangered plant species[0.8]
protected species,category of endangered species[0.2],endangered species [0.2]
rare species,category of endangered species[0.2],extinct species[0.2],vanished species[0.2]