objective: researchers need access to data, regardless of the language used in the metadata. our...

1
Objective: Researchers need access to data, regardless of the language used in the metadata. Our objective is to facilitate discovery of ILTER data regardless of the languages used duiring their creation Abstract: Data in ILTER data archives are only useful if it can be found by researchers. At the international-level the challenge of searching datasets is compounded by the need to deal with multiple languages. A series of workshops in China explored the challenges of creating an information management system for the International LTER. During a 2008 workshop at Lake Taihu, participants recommended that each ILTER region host a Metacat to house network metadata as Ecological Metadata Language documents. Since that time, Metacat-based metadata catalogs have been established in Taiwan, Japan, Spain, Brazil, and Malaysia. A second workshop in Shanghai, China in 2012 explored the options for using a multilingual controlled vocabulary that would allow researchers to discover ILTER data on an international scale. The latter workshop led to development of prototype search tools that incorporate both translation and search enrichment services. The search enrichment services allow automated search on more specific terms (i.e., “narrower terms”). Thus a search on “forest ecosystems” would include datasets whose metadata included "boreal forests", "clearcuts", "forests", "old-growth forests" and "old growth” as well. Adding a translation layer adds additional search terms ("Bosque", “ Foresta", "Forst", "Forêt", "Las", "Metsä", "Skog", "Wald", " 森森 ", and " 森森 "). The prototype tools use EnvThes (an existing multilingual thesaurus that already fully incorporates the U.S. LTER controlled vocabulary) as their thesaurus, but are web- service based, allowing them to be incorporated into a wide array of customized searching applications. Prototype tools can be seen at: http://vocab.lternet.edu/ILTER Scientists seeking data should be able to efficiently and reliably locate ILTER datasets through searching… Goals: Identify solutions that will permit data discovery by a linguistically diverse set of researchers Identify a list of preferred terms and create a controlled vocabulary that could be used by sites in creating metadata documents Focus on ILTER-wide searches Want to facilitate cross-site synthesis on an international basis Translate preferred terms to multiple languages to create a multilingual controlled vocabulary Enable locating datasets through searching Enhance existing data discovery tools to make them work better through the organization of terms in ways that facilitate linkages among them Solutions: US LTER adopted a 634 term controlled vocabulary in 2010. It has been incorporated in to EnvThes, the EnvEurope project’s multilingual thesaurus (Figure 2) Providing a rich context for terms by exploiting interrelationships with other terms helps overcome some of the ambiguity introduced through translation Products: The ILTER data discovery prototype interface is accessible at: http :// vocab.lternet.edu/ILTER. More about the EnvThes thesaurus is available here: http :// www.lter-europe.net/info_manage/EnvThes The US LTER Controlled Vocabulary Working Group has developed a number of applications and web services to support the use of the LTER Controlled Vocabulary. This page provides a listing of resources: http://im.lternet.edu/vocab_resources Copies of code etc. are available on the LTER SVN site ( http://svn.lternet.edu ) in the "vocab" tree. Focus of U.S. Controlled Vocabulary WG and EnvThes Multilingual and Thesaurus-Based Search Tools for International Long-Term Ecological Research Data Kristin Vanderbilt 1 , Nicolas Bertrand 2 , David Blankman 3 , Xuebing Guo 4 , Don Henshaw 5 , Honglin He 4 , Karpjoo Jeong 6 , Eun-Shik Kim 7 , Chau-Chin Lin 8 , Sheng-Shan Lu 8 , Margaret O’Brien 9 , Éamonn Ó Tuama 10 , Takeshi Osawa 11 , John Porter 12 , Wen Su 4 and other participants of the 2008 and 2012 ILTER workshops. 1 Sevilleta LTER, University of New Mexico, Albuquerque, NM, USA; 2 CEH Lancaster, Lancaster Environment Centre, Lancaster, United Kingdom; 3 Israel LTER, Jerusalem, Israel; 4 Institute of Geographical Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China; 5 H.J. Andrews LTER, U.S. Forest Service Pacific Northwest Research Station, Corvallis, OR, USA; 6 Department of Advanced Technology Fusion, Konkuk University, Seoul, Korea; 7 Department of Forestry, Environment, and Systems, College of Forest Science, Kookmin University, Seoul, Korea; 8 Taiwan Forestry Research Institute, Taipei, Taiwan; 9 Santa Barbara Coast LTER, Marine Science Institute, University of California at Santa Barbara, Santa Barbara, CA, USA; 10 Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark; 11 National Institute for Agro-Environmental Sciences, Tsukuba, Japan; 12 Virginia Coast Reserve LTER, Department of Environmental Sciences, University of Virginia, Charlottesville, VA, USA Figure 1. Terms selected from more than one natural language vary in the extent to which they represent the same concepts. These variations can be seen as forming a continuum that ranges from exact equivalence in meaning through a) inexact equivalence, b) partial equivalence, c) single to many equivalence, to d) non-equivalence. Language and Search System Selection Search Form with Autocomplete Preferred Term Lists Translator and Search Enhancer Multilingual Thesaurus Metacat Data Catalog List of Datasets Figure 3. Prototype system for multilingual searching of ecological data. User selects the language they wish to use for search input and the Metacat to be used in the searching. Based on that selection an appropriate autocompletion list can be used to help guide the searcher towards terms in the thesaurus. The thesaurus can then be used to select additional terms such as synonyms or narrower terms, prior to preparing a search. The enhanced search “pathQuery” can then be sent to the Metacat, previously selected, after being translated into the language appropriate for a particular metacat. Select Languages for input and searching Groups of terms can be selected based on their relationship the input term and searched Select a Metacat to search Results for a search of the TFRI Metacat using Chinese Terms Figure 4. Sequence of web pages used to conduct a multilingual search for data. The search process can be simplified by adding “default” choices. Autocomplete helps guide choices to words in the Thesaurus Table 1. Inexact equivalence of US English and Japanese ‘wetlands’ concepts Prototype Multilingual Data Discovery System: A system for searching multilingual metacats has been developed A user selects a term from the controlled vocabulary in their language of choice, and the query will be translated via the multilingual thesaurus (Figure 3) to find datasets documented in other languages (Figure 4) Figure 2. Lexical resources such as taxonomys, thesauri and ontologies help provide linkages between concepts. These linkages can be exploited to interconnect concepts in different languages. The EnvThes thesaurus fully incorporates the U.S. LTER Thesaurus (http://vocab.lternet.edu), along with other resources. It also has translations of many of the terms into 13 languages, including French, German, Chinese , Japanese, Italian, Finnish, Polish, Portuguese, Swedish and English. Acknowledgements: The two workshops held to support development of the ILTER Multilingual Information Management System were held at Lake Taihu Field Station and Eastern China Normal University in Shanghai with the generous support of the Chinese Ecological Research Network (CERN) and the National Ecosystem Research Network of China (CNERN). The U.S. National Science Foundation supported travel of U.S. Challenges: • Often several terms are used for one concept in a single network, making it necessary to search on multiple equivalent terms One site uses CO2 another Carbon Dioxide, another Carbon-dioxide Carbon to Nitrogen Ratio, C:N, C:N Ratio, Carbon-to- nitrogen Ratio Searching on a broader term does not return narrower terms Searching on “Landscape Change” doesn’t find data sets related to “desertification” even though desertification is a kind of landscape change • The meaning of a term in one language may fall along the continuum of exact equivalence to non-equivalence in another language (Figure 1 and Table 1)

Upload: frederick-lucas

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Objective: Researchers need access to data, regardless of the language used in the metadata. Our objective is to facilitate discovery of ILTER data regardless

Objective: Researchers need access to data, regardless of the language used in the metadata. Our objective is to facilitate discovery of ILTER data regardless of the languages used duiring their creation

Abstract: Data in ILTER data archives are only useful if it can be found by researchers. At the international-level the challenge of searching datasets is compounded by the need to deal with multiple languages. A series of workshops in China explored the challenges of creating an information management system for the International LTER. During a 2008 workshop at Lake Taihu, participants recommended that each ILTER region host a Metacat to house network metadata as Ecological Metadata Language documents. Since that time, Metacat-based metadata catalogs have been established in Taiwan, Japan, Spain, Brazil, and Malaysia. A second workshop in Shanghai, China in 2012 explored the options for using a multilingual controlled vocabulary that would allow researchers to discover ILTER data on an international scale. The latter workshop led to development of prototype search tools that incorporate both translation and search enrichment services. The search enrichment services allow automated search on more specific terms (i.e., “narrower terms”). Thus a search on “forest ecosystems” would include datasets whose metadata included "boreal forests", "clearcuts", "forests", "old-growth forests" and "old growth” as well. Adding a translation layer adds additional search terms ("Bosque", “ Foresta", "Forst", "Forêt", "Las", "Metsä", "Skog", "Wald", "森林 ", and "皆伐 "). The prototype tools use EnvThes (an existing multilingual thesaurus that already fully incorporates the U.S. LTER controlled vocabulary) as their thesaurus, but are web-service based, allowing them to be incorporated into a wide array of customized searching applications. Prototype tools can be seen at: http://vocab.lternet.edu/ILTER

Scientists seeking data should be able to efficiently and reliably locate ILTER datasets through searching…

Goals:• Identify solutions that will permit data discovery by a linguistically diverse set of researchers• Identify a list of preferred terms and create a controlled vocabulary that could be used by sites in

creating metadata documents• Focus on ILTER-wide searches • Want to facilitate cross-site synthesis on an international basis

• Translate preferred terms to multiple languages to create a multilingual controlled vocabulary• Enable locating datasets through searching• Enhance existing data discovery tools to make them work better through the organization of terms in

ways that facilitate linkages among them

Solutions:• US LTER adopted a 634 term controlled vocabulary in 2010.• It has been incorporated in to EnvThes, the EnvEurope project’s multilingual thesaurus (Figure 2)• Providing a rich context for terms by exploiting interrelationships with other terms helps overcome some of the ambiguity

introduced through translation

Products:• The ILTER data discovery prototype interface is accessible at: http://vocab.lternet.edu/ILTER. • More about the EnvThes thesaurus is available here:

• http://www.lter-europe.net/info_manage/EnvThes• The US LTER Controlled Vocabulary Working Group has developed a number of applications and web

services to support the use of the LTER Controlled Vocabulary. This page provides a listing of resources: • http://im.lternet.edu/vocab_resources

• Copies of code etc. are available on the LTER SVN site (http://svn.lternet.edu) in the "vocab" tree.

Focus of U.S. Controlled Vocabulary WG and EnvThes

Multilingual and Thesaurus-Based Search Tools for International Long-Term Ecological Research Data

Kristin Vanderbilt1, Nicolas Bertrand2, David Blankman3, Xuebing Guo4, Don Henshaw5, Honglin He4, Karpjoo Jeong6, Eun-Shik Kim7, Chau-Chin Lin8, Sheng-Shan Lu8, Margaret O’Brien9, Éamonn Ó Tuama10, Takeshi Osawa11, John Porter12, Wen Su4 and other participants of the 2008 and 2012 ILTER workshops. 1Sevilleta LTER, University of New Mexico, Albuquerque, NM, USA; 2CEH Lancaster, Lancaster Environment Centre, Lancaster, United Kingdom; 3Israel LTER, Jerusalem, Israel; 4Institute of Geographical Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China; 5H.J. Andrews LTER, U.S. Forest Service Pacific Northwest Research Station, Corvallis, OR, USA; 6Department of Advanced Technology Fusion, Konkuk University, Seoul, Korea; 7Department of Forestry, Environment, and Systems, College of Forest Science, Kookmin University, Seoul, Korea; 8Taiwan Forestry Research Institute, Taipei, Taiwan; 9Santa Barbara Coast LTER, Marine Science Institute, University of California at Santa Barbara, Santa Barbara, CA, USA; 10Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark; 11National Institute for Agro-Environmental Sciences, Tsukuba, Japan; 12Virginia Coast Reserve LTER, Department of Environmental Sciences, University of Virginia, Charlottesville, VA, USA

Figure 1. Terms selected from more than one natural language vary in the extent to which they represent the same concepts. These variations can be seen as forming a continuum that ranges from exact equivalence in meaning through a) inexact equivalence, b) partial equivalence, c) single to many equivalence, to d) non-equivalence.

Language and Search System Selection Search Form with Autocomplete

Preferred Term Lists Translator and Search Enhancer

Multilingual Thesaurus

Metacat Data Catalog

List of Datasets

Figure 3. Prototype system for multilingual searching of ecological data. User selects the language they wish to use for search input and the Metacat to be used in the searching. Based on that selection an appropriate autocompletion list can be used to help guide the searcher towards terms in the thesaurus. The thesaurus can then be used to select additional terms such as synonyms or narrower terms, prior to preparing a search. The enhanced search “pathQuery” can then be sent to the Metacat, previously selected, after being translated into the language appropriate for a particular metacat.

Select Languages for input and searching

Groups of terms can be selected based on their relationship the input term and searched

Select a Metacat to search

Results for a search of the TFRI Metacat using Chinese Terms

Figure 4. Sequence of web pages used to conduct a multilingual search for data. The search process can be simplified by adding “default” choices.

Autocomplete helps guide choices to words in the Thesaurus

Table 1. Inexact equivalence of US English and Japanese ‘wetlands’ concepts

Prototype Multilingual Data Discovery System:• A system for searching multilingual metacats has been developed• A user selects a term from the controlled vocabulary in their language of choice, and the query will

be translated via the multilingual thesaurus (Figure 3) to find datasets documented in other languages (Figure 4)

Figure 2. Lexical resources such as taxonomys, thesauri and ontologies help provide linkages between concepts. These linkages can be exploited to interconnect concepts in different languages. The EnvThes thesaurus fully incorporates the U.S. LTER Thesaurus (http://vocab.lternet.edu), along with other resources. It also has translations of many of the terms into 13 languages, including French, German, Chinese , Japanese, Italian, Finnish, Polish, Portuguese, Swedish and English.

Acknowledgements: The two workshops held to support development of the ILTER Multilingual Information Management System were held at Lake Taihu Field Station and Eastern China Normal University in Shanghai with the generous support of the Chinese Ecological Research Network (CERN) and the National Ecosystem Research Network of China (CNERN). The U.S. National Science Foundation supported travel of U.S. participants.

Challenges: • Often several terms are used for one concept in a single

network, making it necessary to search on multiple equivalent terms

• One site uses CO2 another Carbon Dioxide, another Carbon-dioxide

• Carbon to Nitrogen Ratio, C:N, C:N Ratio, Carbon-to-nitrogen Ratio

• Searching on a broader term does not return narrower terms

• Searching on “Landscape Change” doesn’t find data sets related to “desertification” even though desertification is a kind of landscape change

• The meaning of a term in one language may fall along the continuum of exact equivalence to non-equivalence in another language (Figure 1 and Table 1)