babel final report · web view2020. 7. 9. · babel final report. june 30, 2020. program for...

BABEL Final ReportJune 30, 2020

Program for Cooperative Cataloging (PCC), Standing Committee on Standards (SCS), BIBFRAME And MARC Bibliographic Encoding for

Languages (BABEL) Task Group

ContentsIntroduction.................................................................................................................................................. 2

Vocabulary Recommendation...................................................................................................................... 2

Types of Language Information in Cataloging Records....................................................................................3

Current and Reviewed Standards..................................................................................................................... 5

Standards & Linked Data Framework............................................................................................................5

MARC language codes............................................................................................................................................................5

ISO 639-2................................................................................................................................................................................6

ISO 639-3................................................................................................................................................................................6

ISO 639-5................................................................................................................................................................................7

IETF BCP-47............................................................................................................................................................................7

Glottolog................................................................................................................................................................................ 7

VIVO language ontology.........................................................................................................................................................7

Wikidata................................................................................................................................................................................. 8

Advantages of ISO 639-3............................................................................................................................. 9

Implementing ISO 639-3................................................................................................................................. 9

Mapping MARC Language Codes to ISO 639-3......................................................................................9

Matching codes....................................................................................................................................................................10

Complexities with matching codes.......................................................................................................................................10

Unmatched codes.................................................................................................................................................................12

Codes only in ISO 639-3........................................................................................................................................................12

General Issues for Implementing ISO 639-3.................................................................................................12

Lack of cross-references.......................................................................................................................................................12

Number of language codes...................................................................................................................................................13

Implementing ISO 639-3 for Languages Associated with the Resource in MARC Records.................................13

Implementing ISO 639-3 in BIBFRAME Records...........................................................................................13

Implementing BCP47 to tag literals in BIBFRAME.....................................................................................14

Steps toward Implementation....................................................................................................................... 16

Appendix..................................................................................................................................................... 18

MARC Documentation Related to Non-MARC Language Schemes.................................................................18

Field 041 examples ..............................................................................................................................................................18

Language Code and Term Source Codes...............................................................................................................................19

PCC policies and documentation................................................................................................................20

1

IntroductionThe BIBFRAME and MARC Bibliographic Encoding for Languages (BABEL) Task Group members include Charlene Morrison (OCLC, chair), Kelley McGrath (SCS), TJ Kao (SCA), Elaine Kim (LC), and Robert Rendall. The BABEL Task Group was charged1 on March 1, 2020 to make recommendations on language vocabularies for use in both BIBFRAME and MARC standards while taking into consideration the various language communities and their needs.

Per the charge laid out, this final report will make a recommendation on a vocabulary to represent languages, scripts, and transliteration schemes in PCC cataloging, whether performed in the BIBFRAME or MARC environments.

To accomplish its charge, the Task Group identified the following as stakeholders in the MARC, language, and linked data communities:

● Charles Riley, African language expert● Larisa Walsh, Cyrillic language expert● ALA ALCTS CaMMS Committee on Cataloging: Asian and African Materials2 ● LD4P2 Non-Latin Script Materials Affinity Group3 ● CORMOSEA (Committee on Research Materials on Southeast Asia), Southeast Asian language

experts4 ● CEAL (Council of East Asian Libraries), East Asian language expert5

● MELA Committee on Cataloging - Middle Eastern languages6 ● VIVO Ontology Group / Violeta Ilik7 ● PCC Linked Data Advisory Committee (LDAC)● ILS Vendors

The Task Group would like to acknowledge the assistance of the following stakeholders in writing the report:

● Charles Riley, African language expert● Andrew Cunningham, language coding systems expert● Larisa Walsh, Cyrillic language expert● PCC Linked Data Advisory Committee (LDAC)

Vocabulary Recommendation

After identifying and reviewing the various language coding standards, the Task Group recommends exploring and testing the use of ISO 639-3 codes in certain environments in MARC cataloging to support the cataloging community as it transitions from MARC to BIBFRAME and a linked data environment. If it

1 https://docs.google.com/document/d/1VHxLJI7yLwpqSDN-ypxyBfVI8i7MH8UStoCGV7n4knY 2 http://www.ala.org/alcts/mgrps/camms/cmtes/ats-ccscataa 3 https://wiki.lyrasis.org/display/LD4P2/Non-Latin+Script+Materials+Affinity+Group 4 https://www.cormosea.org/ 5 https://www.eastasianlib.org 6 https://sites.google.com/site/melacataloging/home 7 https://wiki.lyrasis.org/display/VIVO

2

is determined that ISO 639-3 can be successfully implemented, we recommend that PCC require the use of 639-3 codes in PCC cataloging; however, we do not think it will be reasonable to forbid the use of other language codes (and specifically MARC language codes) in catalog records. It will also be challenging to convert all MARC language codes in existing records accurately to the correct and sometimes more specific ISO 639-3 codes (see discussion below). We will need to ensure that any language codes are clearly identified by the vocabulary to which they belong, and that local and shared systems will be able to display information derived from codes in different vocabularies to the extent desired and appropriate in a given context.

For BIBFRAME cataloging, the Task Group will also recommend exploring the following possibilities:

● expanding the use of ISO 639-3 in replace MARC language codes in other areas (for example for language of cataloging)

● use of ISO 15924 for script of the resource● use of BCP 47 to tag literals with language and script, and potentially romanization scheme

Types of Language Information in Cataloging Records

The Task Group discussed the following types of language information in cataloging records:

1. Language associated with the resource (e.g., language of the resource or of some component of the resource or the original language of the resource)

2. Language of description (i.e., the language of the catalog record) 3. Language of particular strings (e.g., title, summary)

Language associated with a resource

Language of cataloging description

Language of statements (literals) in description

Types of information likely to be of interest

Language, script Language, script Language, script, transliteration

MARC 21 Bibliographic8

008/35-37 Fixed-Length Data Elements / Language041 Language Code377‡a Language code775‡e Other Edition Entry / Language code

040‡b Cataloging Source / Language of cataloging

242‡y Translation of Title by Cataloging Agency / Language code of translated title336‡2 Content Type / Source337‡2 Media Type / Source338‡2 Carrier Type / Source

MARC 21 Authority9 377‡a Associated Language / Language Code


8 https://www.loc.gov/marc/bibliographic/ and https://www.loc.gov/standards/sourcelist/index.html9 https://www.loc.gov/marc/authority/

3

MARC 21 Holdings10 008/22-24 Fixed-Length Data Elements / Language

040‡b Record Source / Language of cataloging

337‡2 Media Type / Source338‡2 Carrier Type / Source

MARC 21 Classification11

084‡e Classification Scheme and Edition / Language code


MARC 21 Community Information12

008/12-14 Fixed-Length Data Elements / Language041 Language Code


BIBFRAME13 languagescript

descriptionLanguage Any literal with language text

MARC 21 vocabulary recommendation

ISO 639-3 in 041 and 377MARC language codes for other fields listedNo implementation of script codes

MARC language codes MARC language codes where currently applicable

BIBFRAME vocabulary recommendation

ISO 639-3 (languages)ISO 15924 (scripts)

ISO 639-3 Subset of BCP 47

In MARC 21 records, the most practical and promising path to improving language-related metadata is by moving from using MARC language codes to ISO 639-3 codes for describing the language of the resource. This would greatly expand our ability to identify specific languages, to make distinctions important for aural language and to deal with certain types of language groupings. This recommendation is discussed in greater detail below. The Task Group recommends limiting this change to fields 041 and 377, which already include subfield $2 where a non-MARC language scheme can be identified as the source of the code. In BIBFRAME, we also recommend using ISO 639-3 for languages associated with the resource.

In MARC 21 records, the script of the resource is currently only recorded as free text in field 546 subfield $b. Although it would be beneficial to record the script of the resource using a standard vocabulary, it is not clear that there is a practical way to incorporate this information into MARC. Field 546 is defined as a note field. Field 041 has few subfields left and it is not clear how one would effectively link the language and the script. It could possibly be incorporated into field 377, but use of field 377 in bibliographic records seems to be designed for records describing expressions without manifestation information and lacks the granularity of field 041. In BIBFRAME, we recommend using ISO 15924 for scripts associated with the resource.

10 https://www.loc.gov/marc/holdings/ 11 https://www.loc.gov/marc/classification/ 12 https://www.loc.gov/marc/community/ 13 http://id.loc.gov/ontologies/bibframe.html

4

In MARC 21 records, the Task Group does not recommend making any changes to coding the language of description (language of cataloging). The field 040 does not currently contain subfield $2, which would be necessary to identify language codes from a non-MARC scheme. It would be complicated to add subfield $2 to field 040 for this purpose. Since PCC libraries are unlikely to catalog in languages that would benefit from an expanded list of languages, there is not a compelling use case for making any change. In BIBFRAME, however, we recommend that ISO 639-3 be used when recording the language of the description as BIBFRAME does not suffer from these constraints. It is not clear to us whether there is a strong use case for recording the script of the description in BIBFRAME, but if it is wanted, there does not currently seem to be a place for it.

MARC 21 currently does not support identifying information about strings, such as language or script, with one exception. In bibliographic records field 242 subfield $y can contain the MARC code for the language of a title that has been translated by the cataloging agency. This field is not widely used in PCC cataloging and is not likely to be used to provide translations into less common languages. It is difficult to envision a practical, cost effective way to retrofit this capacity into MARC. In BIBFRAME, the Task group recommends that PCC explore options allowing us to more fully code this information using BCP 47 as BIBFRAME supports identifying information about strings, such as language or script.

Because we will expect to continue to work at least partly in MARC for some time to come, the Task Group feels that implementation of ISO 639-3 should be explored in our current MARC environment, to the extent possible. Although postponing all change in practice related to language codes until we have fully transitioned to a BIBFRAME environment might seem easier, the need for more granular and precise encoding of languages has been felt in some communities for a long time, and the benefits that would result from a successful implementation in MARC would be immediate and greatly appreciated.

Current and Reviewed StandardsThe Task Group reviewed the standards specified in the charge as well as several other additions. Within the charge, three vocabularies were identified, ISO 639-2, ISO 639-3, and IETF. Along with these three, the Task Group also looked into ISO 639-5 and Glottolog as well as possible linked data options aside from LC Linked Data Service for supporting ISO 639-3, VIVO language ontology, Wikidata.

Standards & Linked Data Framework

MARC language codes

MARC language codes14 are currently used by libraries in the MARC and BIBFRAME cataloging environments. There are 516 codes, which include 31 discontinued codes. The codes correspond to ISO 639-2. Where there are both bibliographic and terminology codes for a single language in ISO 639-2, the MARC language codes use the bibliographic codes. With labels only in English (which in some cases differ slightly from ISO 639-2’s English labels), the focus of the MARC code set is on languages most commonly found in library collections, and the vocabulary groups languages without their own code under collective codes. One example is that many Baltic languages are grouped under the code “bat” instead of having their own code. This set of codes does not include a "macrolanguage" concept, and the

14 https://www.loc.gov/marc/languages/langhome.html

5

treatment of languages considered to be macrolanguages in ISO 639-3 varies; sometimes MARC has only a code for the macrolanguage (Chinese), sometimes only for the individual languages (Serbian, Croatian, etc) and sometimes both (Norwegian and its variants). This vocabulary provides coding for what it considers the major languages of the world and groups other languages together to cover everything else.

ISO 639-2

ISO 639-215 includes 547 codes, some of which are duplicate codes that allow the use of different codes for the same language for bibliographic and terminology purposes. This vocabulary was based on the MARC code list and published in 1998. With labels in English, French, and German, this set also focuses on languages most commonly found in library collections, with collective codes used to cover less common languages. As with MARC language codes, one example of a collective code is “bat”, the code representing Baltic languages, although only the German label clarifies that this code is for other Baltic languages not covered by individual codes. Like the MARC codes, this set of codes does not include a "macrolanguage" concept, and the treatment of languages considered to be macrolanguages in ISO 639-3 varies; sometimes ISO 639-2 has only a code for the macrolanguage (Chinese), sometimes only for the individual languages (Serbian, Croatian, etc.) and sometimes both (Norwegian and its variants). Also lLike the MARC language codes, this vocabulary provides coding for what is considered major world languages and groups other languages together to cover everything else.

ISO 639-3

ISO 639-316 contains 7,868 codes including all of the individual language codes already accounted for in ISO 639-2. It also provides codes for macrolanguages (Serbo-Croatian) or individual languages within macrolanguages (Mandarin Chinese) not included in ISO 639-2. In addition to this, it also includes codes derived from Ethnologue and LinguistList covering thousands of other living languages as well as extinct, ancient, historic, and constructed languages. With labels for these codes in English, it maps macrolanguage codes to individual language codes. However, it does not include any codes for collections of related languages, so any new language not already included would need to be added. While related to the MARC language codes and the ISO 639-2 codes, this vocabulary does not have a 1 to 1 mapping with some languages that changed over time, for example Old Spanish and Old Tamil which have separate 639-3 codes but are included within Spanish and Tamil in the smaller vocabularies. Unlike the other two vocabularies, this one has a greater breadth and coverage of past and present languages.

ISO 639-5

ISO 639-517 contains 115 codes and supplements the coding of language groups and families in ISO 639-2. Out of the 115 codes, 65 match ISO 639-2/MARC language codes and are either language group codes or remainder group codes. The former groups two or more individual languages as a unit, while the latter groups languages with the exclusion of specific languages that have separate identifiers. The intent is to support the current ISO 639 standards instead of providing a scientific classification of the

15 https://www.loc.gov/standards/iso639-2/php/code_list.php 16 https://iso639-3.sil.org/code_tables/639/data 17 https://www.loc.gov/standards/iso639-5/

6

languages of the world. With labels for these codes in English, the codes are hierarchical in nature and are intended to identify membership in language families.

IETF BCP-47

IETF/BCP-4718 language tags are made up of a combination of language codes separated by hyphens. The structure includes primary subtags, extended language subtags, script subtags, region subtags, variant subtags, extension subtags, and private use subtags. Subtags are based on various other standards such as ISO 639 (i.e. 639-1, 639-2, 639-3, and 639-5), ISO 15924, ISO 3166-1, and UN_M.49). Language, extended language, script, region, and variant subtags are listed in the IANA Language Subtag Registry. While capitalization is not significant, capitalization conventions are used. Language, extended language, and variant codes are lower case, while script codes capitalize the first letter followed by lower case letters, and two-letter country codes are in upper case. The intent is to utilize current language codes to create more meaningful tags that can be either simple or complex.

Glottolog

Glottolog19 provides codes for all languoids. This would include language families, languages, and dialects. While it does match up the ISO-639-3 code to the Glottocode, there is not always a 1 to 1 correlation. Glottocodes are composed of a combination of 4 letters and numbers followed by 4 more numbers. Its focus seems to be on lesser known languages, including modern languages that are endangered and assumed languages. It does not provide coding for historical languages.

VIVO language ontology

VIVO language ontology20 is an open source ontology21 that is based on ISO 639 language codes as its sources. While it currently incorporates ISO 639-1 and ISO 639-2, it doesn’t appear that ISO 639-3 has been fully integrated into the ontology at this time. The intent of this ontology is to allow for identification of both written and spoken languages. Labels are already in English, French, and German, and the ontology supports the addition of labels in native languages and provides a model to convert ISO 639 language codes into a linked data framework.

Example22

18 https://tools.ietf.org/html/bcp47 19 https://glottolog.org/ 20 https://github.com/vivo-community/language-ontology 21 https://duraspace.org/vivo/about/ 22 https://raw.githubusercontent.com/vivo-community/language-ontology/master/lang.owl

7

There are several issues with using the VIVO Language Ontology. At this time the ontology is still in beta so may not provide the stability needed to support ISO 639-3, the URIs are not dereferenceable, and the ontology only contains a list of codes.

Wikidata

Wikidata, a “free and open knowledge base,”23 provides another option for creating IRIs to support use of ISO 639-3 in a linked data environment. A P number, P220,24 already exists for ISO 639-3 and Wikidata does make use of this property in its data. One example is the entry for Tunisian Arabic25, i.e. Q56240, which is linked to the ISO 639-3 code, aeb.

Running a SPARQL query26 to pull all of the codes used in Wikidata using the Wikidata Query Service brings up 8243 results, which is more than the 7868 codes defined in ISO 639-3. Much of the discrepancy is due to the inclusion in Wikidata of codes that have been deprecated in ISO 639-3. For example, Aramanik language27 and Asa language28 are two separate Wikidata entities, even though in ISO 639-3, Aramanik has been deprecated in favor of Asa language29. Some ISO 639-3 codes are used by more than one Wikidata entity and some Wikidata entities include more than one ISO 639-3 code. One explanation might be that there are duplicate entries for the same language in some cases. For example,

23 https://www.wikidata.org/wiki/Wikidata:Main_Page24 https://www.wikidata.org/wiki/Property:P220 25 https://www.wikidata.org/wiki/Q5624026 https://w.wiki/V2a 27 https://www.wikidata.org/wiki/Q56541 28 https://www.wikidata.org/wiki/Q56620 29 https://iso639-3.sil.org/code/aam

8

Q745260230 and Q3073200231 both represent the language, Sera. These duplicate entries presumably should be merged in Wikidata. Another explanation might be that languages are broken down into more specific subsets for which no more specific code exists in ISO 639-3. For example, the Egyptian32 and Late Egyptian33 entities have the same ISO 639-3 language code, egy. In addition, at the time of this writing, 28 ISO 639-3 codes do not occur in Wikidata at all.

Advantages of ISO 639-3

Because ISO 639-3 contains only three-letter codes, it will be easier for some existing MARC-based systems to integrate. Prior to the introduction of a method for encoding non-MARC language codes in MARC in 2001,34 only three-letter codes from the MARC list of languages were permitted and multiple languages were coded in a single subfield.

041 0# $d engfregerrus $e engfregerrus $h engfregerrus $g engfreger $h eng

Systems parsing these codes for display relied on the standard length of the codes to split out multiple codes in a single subfield. Many library databases may contain instances of this older method of language coding. Some systems still incorporate an expectation of three-letter codes into their parsing algorithm.

Because ISO 639-3 has more comprehensive coverage of spoken as well as written languages, it will allow specific coding of many languages that currently can only be recorded with group codes in the MARC vocabulary. With the availability of codes for languages within macrolanguages, it will also allow catalogers to make important distinctions (such as between Mandarin and Cantonese Chinese in film soundtracks) which cannot be encoded in a machine-readable way in MARC records that only use MARC language codes.

Implementing ISO 639-3

Mapping MARC Language Codes to ISO 639-3

There are 485 currently valid MARC language codes and 7867 currently valid ISO 639-3 codes. This section describes the process of mapping MARC language codes to ISO 639-3 and the potential alignment issues that we have identified.

30 https://www.wikidata.org/wiki/Q7452602 31 https://www.wikidata.org/wiki/Q30732002 32 https://www.wikidata.org/wiki/Q50868 33 https://www.wikidata.org/wiki/Q1852329 34 https://www.loc.gov/marc/marbi/2001/2001-06.html

9

Matching codes

Exact matches

419 of the 485 MARC language codes (86%) directly or indirectly match ISO 639-3 codes. In 306 cases, both the code and the label match. In 93 cases, the code matches, but the label does not. Most of these variations involve order or parenthetical qualifications and are obviously not substantive (e.g., "Ainu (Japan)" vs. "Ainu" or "Old English (ca. 450-1100)" vs. "English, Old (ca. 450-1100)"). The rest are intended to have equivalent meanings,35 although this possibly should be reviewed.

ISO 639-2B vs. ISO 639-2T terms

ISO-639-2 has two flavors: ISO 639-2B (bibliographic) and ISO 639-2T (terminology). The MARC language code list is the same as ISO 639-2B. The two ISO 639-2 lists are intended to be equivalent, but there are twenty cases where they use different codes for the same language. For example, the ISO 639-2B/MARC code for Chinese is “chi” while ISO 639-2T uses “zho.” In the cases where the language codes vary, ISO 639-3 uses the ISO 639-2T value. Therefore, although there is a 1:1 relationship between the MARC language code and the ISO 639-3 code, the values are not the same. This does not present a problem for display, but may be a challenge for search and indexing in some systems if they have to deal with databases that include values from both schemes. Systems would need a mechanism to collapse the equivalent ISO 639-2B/MARC and ISO 639-3 codes into a single value. There are a couple of machine-readable sources that could help with these mappings. SIL International, the organization that maintains ISO 639-3, provides tab-delimited, UTF-8 text files that include mappings of ISO 639-3 to the other ISO 639 standards for download.36 Wikidata also appears to provide links between the two vocabularies (e.g., https://www.wikidata.org/wiki/Q7850).

Complexities with matching codes

There are a couple of situations where naive matching of equivalent ISO 639-3 and MARC language codes is potentially sub-optimal.

Macrolanguages

ISO 639-3 includes 62 codes for “macrolanguages.” 58 of these have an equivalent MARC language code. Macrolanguage codes are used to identify “clusters of closely-related language varieties that ... can be considered distinct individual languages, yet in certain usage contexts a single language identity for all is needed.”37 Examples of macrolanguage codes include ara (Arabic), zho (Chinese), nor (Norwegian) and hbs (Serbo-Croatian).

These are three possible types of matches between MARC language codes and ISO 639-3 macrolanguages:

35 “The alpha-3 identifiers for these two non-intersecting sets of code elements are guaranteed to be distinct; that is, every alpha-3 language identifier has a single denotation across the union of code elements from all parts of ISO 639.” (https://iso639-3.sil.org/about/relationships)36 https://iso639-3.sil.org/code_tables/download_tables 37 https://iso639-3.sil.org/about/scope#Macrolanguages

10

● MARC includes only the equivalent for the macrolanguage (e.g., Arabic, Chinese, Latvian)● MARC includes both the equivalent for the macrolanguage and some or all of the individual

languages contained in the macrolanguage (e.g., Norwegian)● MARC includes only the equivalent for the individual languages contained in the macrolanguage

(e.g., Serbo-Croatian)

There are three issues related to macrolanguages that need to be considered.

1. Impact on systems

Individual languages that are part of macrolanguages need to search and facet both as individual languages and as part of the macrolanguage for optimal usability. Users should be able to find things with the ISO 639-3 code “cmn” both when limiting to Mandarin (e.g., if they are searching for videos with Mandarin soundtracks) and when limiting to Chinese (e.g., if they want to see all the Chinese language resources recently acquired by the library or that address a certain topic). This collocation problem is similar to that discussed in the previous section on ISO 639-2B vs. ISO 639-2T except that it requires a single value to be mapped to multiple values rather than multiple values mapping to a single value.

2. Policies for cataloging practice

Policies would need to be developed for when to use macrolanguage codes and when to use individual language codes and, possibly, when to use both. For example, the ISO 639-3 macrolanguage code lav (Latvian) corresponds to the MARC language code lav (Latvian), but ISO 639-3 also includes the more specific code lvs (Latvian, Standard) which is considered to be a language within that macrolanguage. Either ISO 639-3 code could reasonably be applied to a resource in Latvian. Catalogers would need guidance on how to apply these codes and others like them going forward.

3. Conversion

Because there are no MARC language code equivalents for most of the individual languages that are part of ISO 639-3 macrolanguages, any attempts to automate incorporating the more specific codes into existing MARC records will be more complex.

Historic languages

ISO 639-3 includes 83 languages that it classifies as “historic.” These languages are “considered to be distinct from any modern languages that are descended from [them]: for instance, Old English and Middle English.”38 Only sixteen of these codes are also present in the MARC language list. Many of the remaining 67 ISO 639-3 historic languages are currently coded in MARC using the code that ISO 639-3 defines only for the modern, living language. For example, some things that are coded “spa” for Spanish in MARC should be coded “osp” (Old Spanish) in ISO 639-3. This is another area where conversion is not straightforward. However, this is not a new problem. Many current MARC bibliographic records, especially those retroconned from cards, fail to use the existing MARC codes for historic languages when they should. It would also be useful to have a machine-actionable way to group the temporal variants for retrieval (e.g., Korean, Middle Korean, Old Korean). A potential issue for consistently applying these

38 https://iso639-3.sil.org/about/types#Historic

11

codes is that only the ones that match the MARC codes include date ranges in their labels.

Unmatched codes

There are 66 MARC codes that do not match ISO 639-3 codes. 65 of these match codes in ISO 639-5,39 which contains language families and groups. However, despite the fact that the same codes are used in both lists, some of them are defined differently and thus cannot be used interchangeably. Each of the ISO 639-5 codes describes a language grouping as a whole, including all of its members. Some of the matching MARC codes are also defined this way and thus, presumably semantically equivalent (e.g., “aus” for Australian languages). However, some of the MARC language codes are defined to include only those languages in a group that don’t have their own individual code. The MARC language code “fiu” (Finno-Ugrian (Other)) is a collective code for Ingrian, Khanty, Livonian, Ludic, Mansi, Mordvin and Veps. The ISO 639-5 code “fiu” is labeled “Finno-Ugrian languages” and covers all the languages in this family, including not just the obscure ones, but also more well-known examples like Finnish and Hungarian.

Ideally, the MARC codes for language groupings would be converted to ISO 639-3 codes for the appropriate individual languages. However, this is challenging to do in an automated, scalable fashion. In some cases, specific languages could be identified in notes or subject headings,40 but some records will require manual review. Databases that include mixed practices where some records are coded for the specific language and some only for language groups will be frustrating and confusing for users. This could be mitigated by the method suggested above for macrolanguages where a search for the group code is designed to include the component languages.

One MARC language code (“him” for Western Pahari languages) lacks a corresponding code in 639-3 or 639-5, but this seems likely to be an oversight and it should have been mapped to 639-5.

Codes only in ISO 639-3

There are 7448 codes in ISO 639-3 that do not have an exact match in the MARC language code list. We have not identified any potential problems with adding these to MARC records other than the issues discussed above.

General Issues for Implementing ISO 639-3

Lack of cross-references

Like most language code vocabularies other than MARC, ISO 639-3 lacks a set of cross-references designed to help catalogers identify the appropriate code(s) for the resource they are describing. For languages that they're not familiar with, catalogers would probably need to do research directly in Ethnologue, which is not free, or in Wikipedia or other sources that reproduce some of its content. This would include determining what the correct 639-3 code would be for some of the language names that are listed as references in MARC but are apparently not in 639-3 (for example, "Čakavian" is listed in MARC as a reference under the collective code for Slavic (Other), but turns out to be listed in 639-3 under a different spelling as a separate language "Chakavian"; "Surzhyk" (mixed Russian-Ukrainian) is

39 https://www.loc.gov/standards/iso639-5 40 https://scholars.sil.org/sites/scholars/files/gary_f_simons/poster/marc-to-olac.pdf

12

also listed in MARC under Slavic (Other), but is not considered to be a valid language in ISO 639-3 and is not included).

Number of language codes

Language codes are generally mapped to labels for use in discovery interfaces or dropdown lists in staff interfaces. A switch to ISO 639-3 or RFC 5646 would increase the size of this mapping table significantly. There are 485 currently valid MARC language codes, but over 7800 in ISO 639-3. This may be an issue for maintenance or for processing in some systems (e.g., Ex Libris’ Primo has a limit on the number of unique values that can be defined for a “static” facet).41 The Task Group believes that linked data-based systems should be better equipped to handle maintenance of these mappings, but the number of codes may still present a challenge for retrieval in some situations.

Implementing ISO 639-3 for Languages Associated with the Resource in MARC Records

The language of the resource is recorded in 008/35-37,42 field 041,43 and field 37744 in the MARC bibliographic record.

Non-MARC language codes can be used in addition to or instead of MARC language codes in the 041 field in the bibliographic format and the 377 field in the bibliographic and authority formats. The source of the language term is identified in subfield $2 and ISO 639-3 already has a code defined. Since the infrastructure is in place, the remaining difficulties are related to the ways that existing systems handle language data in MARC records.

The one implementation issue that is unique to the MARC environment is the inability to use codes from non-MARC language schemes in the language fixed field (008/35-37). Non-MARC language codes can only be used in field 041. If only non-MARC language codes are used in a record, the MARC format says to use fill characters in field 008. Although most systems likely use a combination of language codes from fields 008 and 041, it is possible that there are systems that only make use of the language code in field 008.

Implementing ISO 639-3 in BIBFRAME Records

The Task Group’s preliminary work looked at the ISO codes converted to URIs by both the LC Linked Data Service,45 VIVO Language Ontology,46 and Wikidata.

LC Linked Data Service includes URIs for MARC Language codes, ISO 639-1, ISO 639-2, and ISO 639-5.

41 https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Back_Office_Guide/100Facets42 https://www.loc.gov/marc/bibliographic/bd008a.html 43 https://www.loc.gov/marc/bibliographic/bd041.html 44 https://www.loc.gov/marc/bibliographic/bd377.html 45 https://id.loc.gov/ 46 https://github.com/vivo-community/language-ontology

13

While it doesn’t include the ISO 639-3, a URI was created to represent the ISO 639-3 code source. Using only this service for BIBFRAME and other linked data applications would currently exclude the ISO 639-3 standard. Because SILS, the organization that maintains ISO 639-3, has not published that vocabulary as linked data, many providers of linked data have used the URIs published by Lexvo.47 However, this website no longer seems to be functional.

In the case of replacing MARC language codes, the alternative should have RDF IRIs ready for consumption or easily generated. One option, VIVO Language Ontology, was created to represent the languages in the ISO standards 639-1, 639-2, and 639-3. This ontology models language under two major classes, “continuants” and “occurants”. Continuants are “recorded text or media,” while occurants are performed works.48 The ontology acknowledges that an occurant could result in a continuant. Another option would be using Wikidata to create IRIs for use. The use of IRIs from either of these options might work to incorporate ISO 639-3 codes into bibliographic data, but further research is needed.

The LC Linked Data Service, ID.LOC.GOV,49 provides both interactive and machine access to commonly used ontologies, controlled vocabularies, and other lists for bibliographic description. It may be desirable for ID.LOC.GOV to host a version of ISO 639-3 as linked data for use by the library community. This would avoid the challenge of trying to maintain a clean mapping with Wikidata or another source. It would also allow the incorporation of cross-references, and possibly usage guidelines, for the library community.

Implementing BCP47 to tag literals in BIBFRAMEIt is not currently possible to identify in a machine-readable way the language, script, etc. of individual elements of a MARC record, and we do not recommend any changes to practice in MARC. BIBFRAME shows more promise in this respect, however, and we do recommend that as part of BIBFRAME implementation the possibility of encoding this sort of information, to the extent desired in different contexts, should be explored.

Unlike in MARC 21, which allows the identification of the language in only a small number of fields and subfields listed in the previous section, BIBFRAME makes it possible to identify the language, script and transliteration of data in nearly all elements. The RDF standard specifies that the language of literals can be identified with a language tag as defined by BCP 47.50 As of June 15, 2020, the Library of Congress’ internal version of BIBFRAME Editor51 provides two separate drop down menus for identification of language and script of numerous elements, including title information, statement of responsibility, edition statement, transcribed provider statement, series statement, and various notes. While both sets of values are from the IANA Language Subtag Registry52, the first value consists of two letters

47 http://www.lexvo.org/ (Website down as of 2020-04-30. See Internet Archive for archived version: https://web.archive.org/web/20191118130121/http://www.lexvo.org/)48 https://docs.google.com/document/d/12WUgUiAWV3nS3nHJPKtHdUqp6nQsvozoHpO2FQXqTYQ49 http://id.loc.gov/50 https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal51 http://mlvlp04.loc.gov:3000/bfe/index.html52 https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

14

representing the language and the second value consists of four letters representing the script. Following BCP 47’s convention, the combination of the two values connected by a hyphen expresses explicitly the language and script of the information, e.g. zh-Hant for Chinese written in traditional Chinese script, ja-Hani for Japanese written in Kanji, ko-Hang for Korean written in Hangul.

Below are examples of how BCP 47 is used to express languages and scripts of a literal that might look identical but actually represents different languages and scripts:

日本人

Language IANA code for script Romanization IANA code for romanization

Chinese zh-Hant Riben ren zh-Latn

Japanese ja-Hani Nihonjin ja-Latn

Korean ko-Hani Ilbonin ko-Latn

If desired, use of BCP 47 could also be expanded to indicate also the romanization scheme used in strings containing romanized text, for example distinguishing between Korean text romanized according to the ALA-LC table and text romanized according to the Korean government's romanization system.

Below is an example of how BCP 47 language subtags is contained in the RDF of BIBFRAME:

_:b0_b1 a bf:Title; rdfs:label "한국사"; bf:mainTitle "한국사"@ko-kore._:b0_b2 a bf:VariantTitle; bf:mainTitle "韓國史"@ko-hani.

In the BCP 47 registry, some languages normally written in just one script are listed with a 'Suppress-Script' field indicating that a script subtag will not add distinguishing information for that language and should not be used. For example the registry indicates that the subtag 'Latn' should not be used with the primary language 'en' because nearly all English documents are written in the Latin script. The implications of this restriction (if it needs to be followed) for the coding and retrieval of library data will require further investigation.

BIBFRAME will need to be able to support mixed practices including both non-Latin script and/or romanization to meet the needs of different library communities, and to accommodate both legacy and newly created data53. However, current BIBFRAME cataloging practice is geared toward using less transliteration. Access points are romanized, but other parts of the bibliographic description are described in the script of the resource for current LC BIBFRAME54 phase 2 pilot program.

53 CC:AAM Statement in Support of the Internationalization of BIBFRAME54 https://www.loc.gov/bibframe/

15

While it should be possible to tag language, script and transliteration for any BIBFRAME element, to ensure maximizing resource discovery while being economic, we would like to recommend the following BIBFRAME elements for tagging:

For text in its original script: Title Information Statement of Responsibility Edition Statement Transcribed provider Statement Series Statement Notes (contents, summary etc.)

For transliterated text: Creator Subject Form/Genre Contributor

We recognize that BCP 47 is a very complex standard. It will be challenging for cataloging interfaces to integrate and validate. However, we see considerable promise in continuing and expanding the use of this type of coding to tag literals in library data.

Steps toward ImplementationThe Task Group recommends the following steps be taken for technical implementation of ISO 639-3, BCP 47, and ISO 15924.

Bidirectional text. Further investigation into coding bidirectional text strings is recommended for BIBFRAME and other linked data services. The Task Group has concern with how the language subtags are put together. While MARC 21 supports bidirectional text through the use of markers (left to right, right to left, and Arabic), RDF does not currently have a way to indicate bi-directional text.55 More research and exploration of the best way to encode bidirectional text in BIBFRAME is needed.

Canonical URIs. We recommend exploring and testing use of ID.LOC.GOV to support the creation of URIs to represent the individual language codes within ISO 639-3 and possibly BCP 47 and ISO 15924 to support their use in BIBFRAME. Support and expertise could be provided by the PCC LDAC, LD4 community, and PCC Wikidata pilot.

55 https://w3c.github.io/rdf-dir-literal/

16

Community feedback. We recommend soliciting community feedback including vendor and ILS provider feedback. As mentioned above, this change includes using two language codes in tandem in MARC during the transition to BIBFRAME along with supporting a larger set of language codes. In particular, we should ascertain the extent to which vendors and ILS providers will be able to effectively support the incorporation of ISO 639-3 codes into discovery interfaces.

Conversion of codes. The BIBFRAME system will need to be able to convert ISO 639-3 codes into the appropriate MARC language codes when MARC records are created from BIBFRAME records. We also recommend investigating strategies for converting MARC language codes into ISO 639-3 codes in MARC records, either as a large-scale project in a database like WorldCat, or on a record-by-record basis, as with the OCLC Music Toolkit.56

Policies and documentation. PCC policies and documentation listed in the Appendix should be updated as needed to account for the use of the ISO 639-3 and BCP 47 codes. We also recommend the creation of new policies and training documentation, especially in applying BCP 47 in BIBFRAME.

Scripts. Further investigation is needed into how ISO 15924 could be implemented in a linked data environment, since canonical URIs do not seem to be currently available for this vocabulary.

56 http://cmc.blog.musiclibraryassoc.org/2018/04/20/new-oclc-music-toolkit-for-generating-faceted-music-data/

17

Appendix

MARC Documentation Related to Non-MARC Language Schemes

Field 041 examples 57

$2 - Source of code: Source of the language code scheme used in the field. Code from: Language Code and Term Source Codes.

If a non-MARC code is used to express the predominant language in an item, field 008/35-37 is coded with three fill characters (| | |).

If more than one code scheme is used in a record, repeat the field.

008/35-37 |||041 07 $a en $a fr $a it $2 iso639-1

008/35-37 eng041 0# $a eng $a fre041 07 $a en $a fr $2 iso639-1[Two language code schemes are used and field 041 is repeated.]

$r - Language code of accessible visual language (non-textual)

Language codes for visual language (non-textual) used to provide alternative access to the audio content of a resource. For example, signed languages.

041 0# $a eng $r sgn041 07 $r ase $2 iso639-3[An English language resource where audio is the primary mode of access, but alternate access is provided with picture-in-picture American Sign Language.]

For resources where signed language is the primary mode of access, subfield $a should be used to record the language code for signed language.

041 0# $a sgn041 07 $a ase $2 iso639-3[A resource where American Sign Language is the primary mode of access.]

57 https://www.loc.gov/marc/bibliographic/bd041.html

18

Language Code and Term Source Codes58

Language code

Language Term

austlang AUSTLANG (Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS))

din2335 Sprachenzeichen: DIN 2335 (Berlin: Beuth)

glotto Glottolog

iso639-1 Codes for the representation of names of languages--Part 1: Alpha-2 code (ISO 639-1:2002) (Geneva: International Organization for Standardization)

iso639-2b Codes for the representation of names of languages--Part 2: Alpha-3 code (ISO 639-2B:2002) (Geneva: International Organization for Standardization). [The bibliographic language codes are identical to both NISO Z39.53 and the MARC Code List for Languages.]

iso639-3 Codes for the representation of names of languages--Part 3: Alpha-3 code for comprehensive coverage of languages (Geneva: International Organization for Standardization)

knia Kody naimenovanii íàzykov: GOST 7.75-97 (Minsk: Mezhgosudarstvennyi sovet po standartizatsii, metrologii i sertifikatsii)

rfc3066 Tags for the identification of languages (January 2001) (The Internet Society) [replaced by RFC 4646 and RFC 4647]

rfc4646 Tags for identifying languages (September 2006) (The Internet Society) [In combination with RFC 4647, replaces RFC 3066. A language identifier as specified by the Internet Best Current Practice specification RFC4646 . This document gives guidance on the use of ISO 639-1, ISO 639-2, and ISO 639-3 language identifiers with optional secondary subtags and extensions. Replaced by RFC 5646.]

rfc5646 Tags for Identifying Languages (September 2009) (The Internet Society) [In combination with RFC 5645, replaces RFC 4646. A language identifier as specified by the Internet Best Current Practice specification RFC5646 . This document gives guidance on the use of ISO 639-1, ISO 639-2, ISO 639-3, and ISO-639-5 language identifiers.]

walso The World atlas of language structures online

58 https://www.loc.gov/standards/sourcelist/language.html

19

PCC policies and documentation

The Task Group identified the following PCC policies and documentation that would need to be revised:

● CONSER Editing Guide (CEG) 59

● CONSER Cataloging Manual (CCM) 60

● CONSER Standard Record (CSR) RDA Metadata Application Profile 61

● Descriptive Cataloging Manual (DCM), Section Z1 62

● Integrating Resources Cataloging Manual 63

● LC-PCC PS 6.11.1.3 64

● LC-PCC PS 7.13.2.3 65

● MARC 21 Encoding to Accommodate RDA Elements in 046, 3XX, 672, 673, and 678 Fields in NARs and SARs 66

● PCC Guidelines for Creating Bibliographic Records in Multiple Character Sets 67

● PCC Provider-Neutral E-Resource MARC Record Guidelines 68

● PCC RDA BIBCO Standard Record (BSR) Metadata Application Profile 69

● Training Materials for the Basic Serials Cataloging Workshop 70

● Training Materials for Serials Holdings Workshop 71

● Training Materials for the Integrating Resources Cataloging Workshop 72

59 https://www.loc.gov/aba/pcc/conser/more-documentation.html#CEG 60 https://www.loc.gov/aba/pcc/conser/more-documentation.html 61 https://www.loc.gov/aba/pcc/conser/documents/CONSER-RDA-CSR.pdf 62 https://www.loc.gov/catdir/cpso/dcmz1.pdf 63 https://www.loc.gov/aba/pcc/conser/word/Module35.doc 64 http://access.rdatoolkit.org/lcpschp6_lcps6-348.html 65 http://access.rdatoolkit.org/document.php?id=lcpschp7&target=lcps7-186#lcps7-186 66 https://www.loc.gov/aba/pcc/rda/PCC%20RDA%20guidelines/RDA%20in%20NARs-SARs_PCC.doc 67 https://www.loc.gov/aba/pcc/bibco/documents/PCCNonLatinGuidelines.pdf (Does not discuss language codes, but would need to be updated if we implemented the use of new codes for scripts.)68 https://www.loc.gov/aba/pcc/scs/documents/PCC-PN-guidelines.html 69 https://www.loc.gov/aba/pcc/bibco/documents/PCC-RDA-BSR.pdf 70 https://www.loc.gov/aba/pcc/conser/scctp/basicppt.html 71 https://www.loc.gov/aba/pcc/conser/scctp/HoldingsSlides.html 72 https://www.loc.gov/aba/pcc/conser/scctp/ir-trainmaterials.html

20

babel final report · web view2020. 7. 9. · babel final report. june 30, 2020. program for...

Documents