to boldly go where no one has gone beforeto boldly go where no one has gone before 210 language...

26
Vol. 5 (2011), pp. 208-233 http://nflrc.hawaii.edu/ldc/ http://hdl.handle.net/10125/4499 Licensed under Creative Commons Attribution Non-Commercial Share Alike License E-ISSN 1934-5275 To BOLDly Go Where No One Has Gone Before Brenda H. Boerger SIL Pacific Area Graduate Institute of Applied Linguistics In this article, I report on a survey designed as the first step in testing claims made regarding the potential of Basic Oral Language Documentation (BOLD) for addressing the urgency of the documentation task. BOLD was developed in response to a number of language documentation challenges, its aim being to design a time-effective way to obtain a core data corpus, thereby allowing for more endangered languages to be documented faster. Af- ter providing background about BOLD and its claims, I report on its use in six field projects which had varying durations and goals. These preliminary results confirm BOLD’s overall soundness, while suggesting minor adjustments in design and protocols. I invite the lan- guage documentation community to participate in BOLD in three ways: (1) make BOLD corpora of undocumented languages a funding priority, (2) use it and require it of students, and (3) help refine BOLD best practices. Since the current rate of new documentations is not keeping pace with language loss, it is only by adopting this or a similar strategy that speech practices of communities around the world can be documented before it is too late. 1. NEED WHICH BOLD ADDRESSES. 1 Ever since the wake-up call sounded by Hale and others in the early 1990s (Hale et al. 1992), linguists have had a growing concern about the speed with which languages and cultures are being absorbed into an increasingly intercon- nected, globalized world. Just over ten years later, Woodbury (2003) reminds us: Many speakers of endangered languages...where the heritage language has already been lost, have described the loss as a loss of identity, and as a cultural, literary, intellectual, or spiritual severance from ancestors, community and terri- tory; and as an example or symbol of the domination of the more powerful over the less powerful (2003:28). In that same article he describes plans for a community-based project among the Cup’ik speakers of Central Alaskan Yupik [esu], to create documentation for a “huge collection of tapes.” They concluded that creating written transcriptions, translations, and interlinear annotation was not a realistic strategy. So, rather than transcribing everything, they planned to make a greater quantity of material accessible by “starting with hard-to-hear tapes and asking elders to respeak them to a second tape slowly so that anyone with training in hea- ing the language can make the transcription if they wish” (Woodbury 2003:45) And rather than crafting written translations, they planned to record “running UN style translations of many more materials.” 1 I would like to thank the respondents to my survey: Jeremiah Aviel, D. Will Reiman, Eldwin Truong, Carla Unseth, and Angela Williams-Ngumbu, as well as Daniel Boerger, Paul Kroeger, Susan Schmerling, Gary Simons, and the anonymous reviewers of this article, whose constructive input led to tightening and clarification. As always, any errors or misinterpretations remain my responsibility.

Upload: others

Post on 24-Mar-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Vol. 5 (2011), pp. 208-233http://nflrc.hawaii.edu/ldc/

http://hdl.handle.net/10125/4499

Licensed under Creative CommonsAttribution Non-Commercial Share Alike License E-ISSN 1934-5275

To BOLDly Go Where No One Has Gone Before Brenda H. Boerger

SIL Pacific AreaGraduate Institute of Applied Linguistics

In this article, I report on a survey designed as the first step in testing claims made regar ding the potential of Basic Oral Language Documentation (BOLD) for addressing the urgency of the documentation task. BOLD was developed in response to a number of language documentation challenges, its aim being to design a time-effective way to obtain a core data corpus, thereby allowing for more endangered languages to be documented faster. Af-ter providing background about BOLD and its claims, I report on its use in six field projects which had varying durations and goals. These preliminary results confirm BOLD’s overall soundness, while suggesting minor adjustments in design and protocols. I invite the lan-guage documentation community to participate in BOLD in three ways: (1) make BOLD corpora of undocumented languages a funding priority, (2) use it and require it of students, and (3) help refine BOLD best practices. Since the current rate of new documentations is not keeping pace with language loss, it is only by adopting this or a similar strategy that speech practices of communities around the world can be documented before it is too late.

1. NEED WHICH BOLD ADDRESSES.1 Ever since the wake-up call sounded by Hale and others in the early 1990s (Hale et al. 1992), linguists have had a growing concern about the speed with which languages and cultures are being absorbed into an increasingly intercon-nected, globalized world.

Just over ten years later, Woodbury (2003) reminds us:

Many speakers of endangered languages...where the heritage language has already been lost, have described the loss as a loss of identity, and as a cultural, literary, intellectual, or spiritual severance from ancestors, community and terri-tory; and as an example or symbol of the domination of the more powerful over the less powerful (2003:28).

In that same article he describes plans for a community-based project among the Cup’ik speakers of Central Alaskan Yupik [esu], to create documentation for a “huge collection of tapes.” They concluded that creating written transcriptions, translations, and interlinear annotation was not a realistic strategy. So, rather than transcribing everything, they planned to make a greater quantity of material accessible by “starting with hard-to-hear tapes and asking elders to respeak them to a second tape slowly so that anyone with training in hea-ing the language can make the transcription if they wish” (Woodbury 2003:45) And rather than crafting written translations, they planned to record “running UN style translations of many more materials.”

1 I would like to thank the respondents to my survey: Jeremiah Aviel, D. Will Reiman, Eldwin Truong, Carla Unseth, and Angela Williams-Ngumbu, as well as Daniel Boerger, Paul Kroeger, Susan Schmerling, Gary Simons, and the anonymous reviewers of this article, whose constructive input led to tightening and clarification. As always, any errors or misinterpretations remain my responsibility.

To BOLDly Go Where No One Has Gone Before 209

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

Even as recently as 2006, Mark Liberman said,

We need effective applied research of several kinds: to determine how much primary data is really needed for adequate documentation; to radically increase the amount of primary data that is collected and analyzed for a given amount of investment; and perhaps to increase the documentary value of collected data, so as to decrease the amount of primary data that is needed to reach an adequate level of documentation.

This is a very difficult set of problems, but unless they are solved, the goal of language documentation will remain out of reach. Any solution will require significant improvements in project design, organization, and productivity, with close collaboration among computational linguists, field linguists, and the affected speech communities. The scope of the problem also motivates more extensive involvement of more of the world’s linguists, in a process that would bring benefits to all parties (Liberman 2006).

Basic Oral Language Documentation (BOLD) was developed in response to the me-thodological challenges of Liberman, and the promising techniques and the depth of loss discussed by Woodbury, as well as the urgency expressed by Hale et al., along with others who have been saying many of the same things. Its aim has been to design an efficient way to obtain a core language data corpus.2 In this article, I provide a short history of BOLD for the record in section 2, define its parameters and claims in section 3, and distinguish it from BOLD:PNG in section 4. Then in sections 5 and 6 I report on results from a survey of six

2 A documentary corpus (its plural being corpora) of a language can have two possible readings. First, it refers to one submission to an archive comprising the data collected with respect to a spe-cific language or speech community—its texts and wordlists, along with their glossing at all levels, including all relevant metadata—the body of which is compiled by one individual or group at one point in time. By this definition, an individual might submit two documentation corpora for the same language several years apart, but each would constitute a separate corpus for archiving. A se-cond reading of “documentary corpus of a language” refers to the totality of materials archived for that language, whether held at one archive or distributed over several—that is, what is known about language X. As outlined here, a BOLD corpus necessarily contains texts, but this definition of documentary corpus does not preclude a (non-BOLD) corpus composed exclusively of a wordlist or dictionary and its annotations. That is, a documentary corpus can be composed of whatever the documenter elects to include. The documentary use of corpus contrasts with two further uses of corpus in lin-guistics: (1) its use in text linguistics, where a text corpus refers to a set of texts used for testing grammatical hypotheses and doing statistical counts of structures; and (2) its use in speech analysis, where a speech corpus refers to a body of digital audio files and their written transcriptions, which are used to do phonetic, acoustic, conversational, and dialectal analyses. These latter two uses are logged by Wikipedia at http://en.wikipedia.org/wiki/Corpus (accessed 13 August 2011) without be-ing distinguished from documentary corpora, which have different primary motivations, but which often include both digital audio and textual recordings.

To BOLDly Go Where No One Has Gone Before 210

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

teams who used BOLD protocols in the field, and discuss how their experiences indicate what profitable revisions might be made to BOLD with regard to assumptions, procedures, time requirements, personnel, and the amount of data targeted. I find that BOLD has poten-tial for speeding the task of language documentation, and conclude in section 7 by inviting the language documentation community to support its use and development.

2. HISTORY AND DEVELOPMENT OF BOLD. To have speakers of the language orally comment on language data, to estimate how much data forms a primary corpus, to assure the value of what is collected, and to make the recordings and annotations in two months or less were the design parameters of BOLD (Simons 2008), with the ultimate goal of making it possible to document more languages before they and their cultural context are lost. If such faster strategies and methodologies are not identified and used, we are likely to fall short of meeting our goals, and many more languages will die with no documentary record at all.

BOLD protocols have been under development since 2007, when Simons formed a team at SIL International to investigate the feasibility of oral language documentation. Later that year, D. Will Reiman started field-testing the methodology in the Kasanga [ccj] language community during a short trip to Guinea-Bissau, and followed up in 2008 with further testing in Kenya. Simons built on those field tests to develop a BOLD language documentation course, which has been offered annually since the January 2009 term of the Graduate Institute of Applied Linguistics (GIAL).

Simons informally shared his conceptualization of BOLD with Steven Bird in 2009, who soon afterward began independent parallel development with paralinguists in Papua New Guinea as described on the BOLD:PNG website (http://www.boldpng.info/; see sec-tion 4 for a comparison of the two).

3. BOLD DEFINED. In order to understand the survey results discussed later in this paper regarding how BOLD might be modified to make it more effective, one must first un-derstand the original BOLD proposal. In this section I outline the original conception of BOLD and what a minimal BOLD corpus is proposed to be.

The three defining characteristics of BOLD (Simons 2008) are:

1) Basic—BOLD is intended to generate a basic corpus that can serve either as a foundation for further work or act as a time capsule in the event that no more ex-tensive work is ever done. This is the minimum corpus which should be produced for every language needing one. Later processing of BOLD corpora can lead to the production of descriptive works like dictionaries and grammars, or language conservationist work like language revitalization materials.

2) Oral—As Himmelmann (1998) so succinctly defines it, “Language Docu-mentation is concerned with compiling, commenting on, and archiving language documents.” There are three kinds of commenting involved in BOLD. In the first, a second speaker of the target language listens to an original recording of a speech event and produces a phrase-by-phrase, careful re-speaking in the vernacular. This is referred to as oral transcription. The second kind of oral commentary is an

To BOLDly Go Where No One Has Gone Before 211

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

oral translation of the phrase-by-phrase careful-speech recording into a language of wider communication (LWC). These are described by Reiman (2010). A third kind of commentary involves an oral discussion, either in the vernacular or in the LWC, during which speakers discuss linguistic, cultural or contextual issues that arise. This, in turn, may require oral transcription or translation.

3) Breadth first—BOLD is a “breadth first” strategy, in contrast to the more customary “depth first” fieldwork approach. The oral recordings mean that writ-ten transcription and analysis are postponed, allowing for a greater quantity and variety of speech acts to be logged as part of the BOLD corpus, as recommended by Himmelmann (1998). In addition to a wide range of naturally occurring com-municative event types, the corpus also includes elicited wordlists and annotated discussions. A BOLD corpus can be documented in two months or less, a fraction of the time that a “depth first” approach takes.

3.1 PARALINGUISTS. According to Simons (personal communication), one advantage of BOLD is that documentation can be done by paralinguists, whom he compares to parame-dics, in that they serve as emergency linguistic technicians until more professional atten-tion can be provided. The minimal qualifications of a paralinguist would be an understand-ing of the language documentation agenda, training in BOLD techniques, and sufficient awareness to collect a full range of linguistic genres. Participation by paralinguists is pos-sible because commenting is done orally; they are therefore able to participate in collec-ting data, whether they are members of the language community or outside researchers. Paralinguists, then, could be anyone with these minimal qualifications, progressing in a rather fluid continuum in which other factors would play a role, up to individuals with some degree of training in linguistics. Obviously, the more linguistically sophisticated a documenter is, the more possible variations there are in augmenting the purely oral corpus, which is the heart of BOLD.

Photo 1: Teke-tyee oral transcription

To BOLDly Go Where No One Has Gone Before 212

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

BOLD’s three distinctive characteristics—basic, oral, and breadth first—have the ad-vantage that they allow for flexibility in time, personnel, and goals. A documenter may be able to spend more or less time in the language area, depending on circumstances, funding, and other criteria, but can still gather a useful corpus. In addition, the documenter may be a paralinguist, a community member, a linguist, or a multi-faceted team. A linguist docu-menter or advanced paralinguist can augment the basic goals of a BOLD corpus to gather additional data with a focus on research questions and include a larger written component, while a community member may augment the corpus with additional documentation of personal histories of the elders or a culturally significant dance.

3.2 BASIC CORPUS. As pointed out above, since BOLD is only intended to provide a “ba-sic” corpus, it logically follows that the corpus could later be used to produce descriptive products such as grammar and phonology sketches. In that sense, the BOLD corpus is a foundation on which to build in the near or more distant future, and therefore addresses Woodbury’s (2003) third agreed-on value for documentation—that it be ongoing. Simons (2008) lists the components of such a BOLD corpus as (a) an introduction to the language and the corpus as a whole, (b) a table of contents, (c) a set of fully commented, digitally recorded audio (and some video) items or events, (d) recordings of discussions of cultural or contextual issues, and (e) elicited lists. There are minimally five parts for each item in (c) and any of (d) spoken in the vernacular:

• the original recording—at normal speaking speeds;• informed consent—normally given orally and recorded with audio and/or video; can be written; • situational and technical metadata—about the people, equipment, and envi- ronment;• the oral transcription—a second speaker listens and repeats the recorded event slowly, phrase by phrase;• the oral transcription—the oral transcription is put into an LWC using the slow, phrase-by-phrase procedure.

The wordlists in (e) record slow speech and an oral translation on the same recording, as described in section 3.2.2. Otherwise, procedures are the same as for the textual events.

3.2.1 TEXTS. One of the questions one might ask regarding BOLD is how much data should be collected for a “basic” corpus and at what rate can one expect to collect it. Simons (2008) proposes that at an average speaking speed of 167 words per minute3, the docu-menter would record 10,000 words per hour of running text. That means that a 100,000-word corpus can theoretically be obtained in about 10 hours of original recordings. Simons further estimates that for each hour of the original corpus, 12 hours of processing are needed: three hours to collect that one hour of running text, three hours for oral transcrip-

3 This assumes a speech rate of 100 to 200 words per minute, 167 being within that range. It should be noted, though, that words per minute would vary widely depending on the length of words in a given language. These merely provide estimates for planning purposes.

To BOLDly Go Where No One Has Gone Before 213

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

tion of the one-hour original, five hours for oral translation of the oral translation, and one hour for corpus management and metadata tasks. If we accept these figures, then it means that in these hypothetical, ideal circumstances one could collect and process a half hour of original data in a six-hour day. Thus it would take 20 days to build a 10-hour corpus of running text. Or put another way, in one month of working five days per week and six hours per day, one researcher working alone could collect and process a corpus of approximately 100,000 words of running text, the major component of what Simons sees as a basic cor-pus.4 These texts then form the basis for syntactic analysis.

Even though the texts are elicited primarily for syntactic analysis, processing the texts using BOLD procedures can have advantages in phonological analysis, as well. The care-ful speech can shed light on the phonological representations. It also provides clearer word boundaries than the text at natural speech speeds. Therefore, a linguist interested in a par-ticular phone or set of phones could theoretically be able to locate them by listening to the normal speech recordings and then compare these to pronunciations in the careful speech “transcriptions,” even before any of the texts have received written transcription. Such a comparison of normal and careful speech recordings may then help the linguist form or test hypotheses regarding phonological or morphophonemic analysis. A further advantage to having careful speech recordings alongside those of normal speech is it that it is far easier to make a written transcription of careful speech than it is to transcribe the speech stream at normal speeds, especially if it is a language that the linguist-transcriber does not know well. In fact, this is recommended by Bouquiaux & Thomas (1992).

3.2.2 WORDLISTS. Ten hours of running textual material meets the need for syntactic data, but does not provide the data most relevant for phonetic and phonological analysis. To make the orally annotated texts useful and to provide data for phonetic and phonological analysis, a wordlist component of 1,000–2,000 elicited items is also necessary, and it is the second data type in a BOLD core corpus. Such a wordlist would be made up of (a) standardized wordlists,5 (b) semantic sets like color terms, numbers, flora and fauna, and (c) grammatical paradigms (Simons 2008). The paradigms can serve as a source for iden-tifying variation in allophones or allomorphs. Furthermore, assuming sufficient linguistic sophistication of the documenter, sentence elicitation in grammatical frames can provide additional paradigmatic data.

The elicitation procedures for wordlists differ, depending on whether they are elicited by a paralinguist or a linguist. If they are collected by a paralinguist, the audio and video recording can be done in one pass, at the same time as elicitation, since the methodology postpones written transcription until the material is put into the hands of a linguist. This is

4 Simons (2008) also indicates that historical reconstruction can be based on much less than 100,000 words, closer to many hundreds of lexical items, while good lexicography requires millions of words of running text.5 Bouquiaux & Thomas (1992) give such a 2000-word starter lexicon. Linguists and anthropologists in various geographic regions, such as Africa, have developed lists relevant to the region, based on semantic domains. In other places, Swadesh lists have been modified to eliminate unknown con-cepts and to include those common to all cultures of the area. The documenter should investigate what lists are available for the relevant area of the world and select a list or lists in advance.

To BOLDly Go Where No One Has Gone Before 214

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

an application of BOLD as it was originally conceived, to address the goal of preserving endangered language data in the absence of adequately advanced linguist facilitators.

However, another BOLD application is one in which a linguist or advanced paralin-guist is the documenter and supplies written transcription in the field. Here, a double-pass method is recommended as best practice. On the first pass, the linguist elicits words and makes a written transcription in IPA and/or the local orthography, if one exists. During this first pass, audio recording should be done to capture any extended discussion—the third kind of commenting. Following that, audio and video recording is done in the second pass, during which the language consultant performs the items which have been discussed previ-ously with the linguist. This yields more condensed files for archiving purposes. Further-more, for the double-pass method, the wordlist should be divided into 100-word chunks to be processed at one time, so that words are still relatively fresh in the consultant’s mind during audio and video recording. BOLD protocols set a reasonable goal at 100 words in the morning and 100 words in the afternoon. Thus, at a rather slow rate of 200 words per day, working five days per week, a documenter working alone could collect 2,000 words in two weeks. Dividing the list into 100-word chunks also accommodates working with mul-tiple wordlist consultants. Another advantage is that in the event of an emergency neces-sitating early departure from the field, one would have 100-word portions of the wordlist goal fully processed.

The wordlist recording procedure differs from the procedure for processing texts. First, it eliminates vernacular normal speech-speed recording by asking the language consultant to pronounce the elicited word three times in careful speech. It also eliminates the separate digital file for the oral translation by having the documenter (or a co-worker) record each wordlist item’s number and its LWC gloss immediately prior to the vernacular recitation. This incorporates the number and gloss as metadata on the original recording, with the result being one audio recording and one simultaneous video recording to capture all the relevant data during wordlist recording.

Video recording of the wordlists is an integral part of BOLD, because the recomm ended close-in video recording of the speaker’s mouth and neck provides contextual and phonetic data. In support of such video recording, Reiman has developed a reclined, ground-level, folding seat that can be produced in the field. It has a place for a medium-sized mirror to be glued onto the seat at mouth level, at an angle of about 40 to 45 degrees in relation to the speaker’s mouth. This allows simultaneous video recording of two profiles—a full front view and a side view of the speaker’s mouth, as shown in the accompanying photos. The two views provide richer visual articulatory data,6 and help compensate, in part, for the absence of a linguist when carried out by a paralinguist.

6 Plans for this collapsible, plywood, folding seat with mirror are available from Reiman at [email protected].

To BOLDly Go Where No One Has Gone Before 215

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

Photo 2: Mirrored chair Photo 3: Two views of face Photos courtesy of D. Will Reiman

Some of these wordlist procedures are similar to normal fieldwork practices. Where they differ are in a) the ability of a paralinguist to collect the data, b) the two-pass and 100-word procedures for obtaining condensed digital files, and c) the video recording using a mirror to obtain two angles.

Once the wordlist has received a written transcription by a linguist, whether during fieldwork or afterward, the transcribed wordlist serves not only as the input to phonetic and phonological analysis, but also as the initial lexicon of an undescribed language, and thereby provides a head start for understanding and analyzing texts. All the written com-ponents—written wordlist in the LWC, oral vernacular wordlist, and (eventually) a written vernacular wordlist—can be used to augment the originally archived corpus.

3.3 A TWO-MONTH BOLD FIELD PROJECT. The BOLD corpus, with its time and per-sonnel parameters as described above, is summarized in Table 1 below. Adjustments can be made in any one parameter by increasing or decreasing others. For example, to decrease the amount of field time needed, one must increase personnel, increase the amount of data collected daily, collect for more total days, or collect less data. Feedback from the survey respondents makes reference to the expectations in the table. Furthermore, to prepare the data for archiving after returning from the field, an additional month is required. Analysis would be subsequent to these three months required for fieldwork plus archiving, and need not be done by the original documenter.

personnel transitions 10 hours text data 1,000–2,000-word list

one fieldworker two weeks, one at each end of trip

one month of six-hour days, five days per week

two weeks at 200 words per day, five days per week

tabLE 1. Initial logistics for a core BOLD corpus project

To BOLDly Go Where No One Has Gone Before 216

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

4. DISTINCT FROM BOLD:PNG. In considering the viability of BOLD for addressing the urgency of language documentation it is important to distinguish BOLD as being discussed here from BOLD:PNG, which documentation specialists may already be familiar with. The differences are significant. While both BOLD and the BOLD:PNG work by Steven Bird and associates use oral annotation as a primary strategy, the goals, procedures, and practices of these two groups differ in at least six ways. The BOLD:PNG website7 says its purpose is:

• to preserve audio recordings of the indigenous languages of Papua New Guinea for use by current and future generations of speakers, scholars, and teachers;• to identify effective techniques for recording, transcribing, and translating oral texts, using inexpensive equipment and voluntary labour;• to encourage speech communities to value their linguistic heritage, and to pass on their ancestral language to future generations.

The BOLD protocols in focus in this paper concur with the first and third of these purposes as being generally applicable to language documentation. But BOLD would have an additional goal of making the data collected publicly available to scholars, as well as to the language community. It also differs with regard to the second point in the area of equipment selection. Rather than using the inexpensive BOLD:PNG digital audio recor-ders which retail for under $50 USD, best practice for BOLD is higher-quality digital audio recorders in the $130–170 USD price range, with the addition of the digital video equip-ment necessary for the video component of the documentary corpus.

The second and third differences are in personnel and training. The BOLD:PNG stu-dents are native speaker paralinguists with 10–12 hours of training, as opposed to linguis-tics graduate students who complete 40 hours of instruction in a three-credit-hour course in language documentation which includes practice in BOLD. This brings up the fourth dis-tinction between the two BOLD training sites, which is motivation and commitment. The BOLD:PNG students donated their labor as part of a course requirement, but may have no long-term interest in language documentation. On the other hand, linguistics students will often have a higher degree of commitment to language documentation and cross-cultural work in general. For example, one survey respondent who completed the BOLD course has become a documentation facilitator in the field.

Regarding archiving, the two BOLD sites are essentially equivalent. At this time, nei-ther BOLD:PNG nor any of the BOLD survey respondents have made a corpus publicly available through an international archive that is publicly searchable and accessible on the World Wide Web. Some of the projects have backed up their work locally in the country where the languages are spoken, using external hard drives which then serve as a resource only for interested persons who can physically travel to the storage site.8

7 http://www.boldpng.info/8 One current GIAL faculty member used BOLD techniques for a project before he joined the fa-culty and archived that corpus at PARADISEC. Prof. Wayne Dye archived nearly 400 items in the Bahinemo [bjh] language spoken in Papua New Guinea.

To BOLDly Go Where No One Has Gone Before 217

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

A fifth difference between the two BOLD procedures is that the richness of the meta-data will be more comprehensive for projects completed by linguists than for BOLD:PNG, partially as a factor of the personnel and training involved, which have already been dis-cussed. Examples of the BOLD:PNG metadata collected are available on the website, and include date, location, identifier, operator, participants, the language’s ISO code, the genre, and the topic of the recording. Further metadata which could be included regards equip-ment, settings, recording situation, and personnel involved in recording.

Finally, a sixth distinction between the two BOLD program sites is that the BOLD:PNG students were sent out individually, admittedly to their own villages, while in our course we encourage BOLD documenters to form and work in teams of expatriates and local per-sonnel whenever possible, so that there is a greater possibility of having complementary skills represented. In fact, Williams-Ngumbu, one of the survey respondents, said in her reply to the survey:

I think it will be hard to find good language documentation specialists…. It seems like something basic, but you need to have both technical skills and lan-guage skills…. The Teke-tyee trip went really well. We had a really good team consisting of three expatriate linguists, one Congolese linguist, one language consultant, and the driver. We had a really good mix of skill sets to draw from. (Williams-Ngumbu, Teke-tyee survey response:3)

5. SURVEY OF SIX BOLD PROJECTS. The information in the sections above now makes it possible to discuss the results of the six BOLD project survey results. It is premature and theoretically unsound to reject an untested methodology out of hand, and it remains to be determined whether claims made by BOLD relating to paralinguists, recording procedures, personnel, corpus size, time requirements, and the future usefulness of an oral-only or primarily oral corpus are valid. Therefore, I designed the survey I report on here to launch what I project to be an on-going evaluation of this promising strategy. Since none of the projects has yet archived their data, it not yet possible to test whether oral-only corpora are sufficient (see section 5.4).

The survey collates responses from six individuals who gave or received BOLD trai-ning in a course or a workshop and have used it in the field during the past few years. Their responses are informative, and with their answers we can begin to address the feasibility of BOLD, using their negative experiences and positive innovations to refine it. Four projects were of a one-week exploratory duration, and the other two were of four and eight weeks’ duration, respectively. The locations of the projects are: one each in the Asia and Pacific regions, and four on the continent of Africa. For ease of comparison, details of the six projects are included in table form. In the discussion, I pinpoint only the details relevant to BOLD protocols.

Before moving to the survey discussion, it is important to be aware of the Extended Graded Intergenerational Disruption Scale (EGIDS), proposed by Lewis & Simons (2010). It integrates Fishman’s GIDS (1991) with the UNESCO six-level scale of endangerment (UNESCO 2003) and the Ethnologue’s (Lewis 2009) five categories, culminating in a 13-level scale. I asked the BOLD survey respondents to evaluate language vitality by

To BOLDly Go Where No One Has Gone Before 218

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

assigning an EGIDS level to the language(s) they worked on. The numerical EGIDS level and its short prose description are given in each of the survey project summaries below.

5.1 LARIKE-WAKASIHU [alo], INDONESIA. Eldwin Truong, who holds an MA in Ap-plied Linguistics, participated in a BOLD field workshop taught by Will Reiman, who was himself assisted by an individual who had recently completed the BOLD course. For the BOLD documentary effort and in the workshop, Truong collaborated with Victorio Litaay, an Indonesian with a BA in literature. Their goal was to practice the BOLD procedures they had recently learned, and to start to create a documentary corpus for the language community.

language Larike-Wakasihu [alo], pop. 12,600 (1987)contact Indonesian [ind] (national), Ambonese Malay [abs] (regional)EGIDS Level 7 “shifting”; younger generation can’t speak the lan-

guage well, prefer regional or national languagegoals practice BOLD, create record of [alo] for language community,

including future generationsduration 1 week personnel Eldwin Lai Truong, MA Applied Linguistics and Victorio

Litaay, BA Literature wordlist 358 wordstexts 1.5 hours texts, including first-person narratives, procedural,

discussions, greetings

tabLE 2. Larike-Wakasihu [alo], Indonesia

The language they worked in was Larike-Wakasihu [alo] of Indonesia, to which they assigned an EGIDS level of 7 “shifting,” because the younger generation does not speak the language well. Their effort involved eight days of video and audio recording, oral anno-tation, and project administration. During that time they collected 358 words and standard greetings, as well as two hours of first-person narrative, procedural, and discussion texts. Their oral informed consents were audio recorded. They plan to archive in Indonesia and also with SIL International.

Truong reports that working with Litaay made him aware that those presumably do-ing the oral translations into Indonesian as the LWC actually might not be making a clear distinction between standard Indonesian [ind] and the regional variety Ambonese Malay [abs]. Truong also indicated that working six-hour days, five days per week (Simons 2008) gave the team too little time for interactions with the community during this short trip.

5.2 KOLUWAWA [klx], PAPUA NEW GUINEA. Jeremiah Aviel completed the BOLD course at GIAL and used independent study credit to pursue BOLD fieldwork in Papua New Guinea. The language he documented is Koluwawa [klx], with a population of 900–1,000 speakers. Aviel assigns this language an EGIDS level of 6a “vigorous” due to the strong

To BOLDly Go Where No One Has Gone Before 219

LaNguagE DocumENtatIoN & coNSErvatIoN voL. 5, 2011

oral traditions which are maintained by the community. Even though the language is writ-ten (EGIDS level 5) and is used through grade two for educational purposes (EGIDS level 4), there is little interest in or use of the written language in the community.

language Koluwawa [klx], pop. 900–1,000contact Iamalele [yml], Bwaidoka [bwd], Tok Pisin [tpi], English [eng]EGIDS Level 6a “vigorous”; used through grade 2, then English; lan-

guage is written, but lack of interest; rich oral traditionsgoals test BOLD in the field; fulfill project portion of language docu-

mentation independent study; obtain recordings to preserve heritage

duration 8 weeks: Jan.–Mar. 2010personnel Jeremiah Aviel, MA studentwordlist 560-word list with example sentencestexts 12.5 hours texts, including folk tale, first-person narrative,

procedural, indigenous song forms

tabLE 3. Koluwawa [klx], Papua New Guinea

Aviel spent eight weeks in the area, divided into three weeks of making primary re-cordings, four weeks of oral transcription and translation, and one week of transition time. He also obtained individual oral audio or video recordings of informed consent at the be-ginning of the recording session, as well as written consent from community elders.

The primary recordings he collected consist of wordlist elicitations, plus a total of eight and one-half hours of textual recordings composed of narrative, procedural, and songs—including indigenous forms. He plans to archive when the metadata is complete.

When asked about recommendations regarding what to do next time, Aviel said, “Bring more batteries, plan for more computer use, live in the community,... develop a better plan with a more specific outcome and then elicit for that specifically if that was the goal.”

He also expressed unease about not marking word breaks or having word-for-word glosses in the texts, but has not yet explored the adequacy of the oral phrase-by-phrase “transcription” and “translation” in slow speech to determine to what degree his unease might be justified.

Photo 4: Koluwawa drummingPhoto courtesy of Kendra Stauffer.

5.3 TEKE-TYEE [tyx], REPUBLIC OF CONGO. Angela Williams-Ngumbu completed her MA at GIAL and took the language documentation course there. She made digital recor-dings of Teke-tyee [tyx], a language spoken by 14,400 people in the Republic of Congo. Many Teke-tyee speakers are multilingual, also speaking French and Munukutuba [mkw], although there are some villages which are mostly monolingual. Williams-Ngumbu assigns the EGIDS level 6a “vigorous” to Teke-tyee.

language Teke-tyee [tyx], pop. 14,400contact multilingual; Munukutuba [mkw] EGIDS Level 6a “vigorous”; there is some development work cur-

rently being done in the languagegoals preliminary documentation required by government and data

for orthography design duration 1 week during August 2010personnel Angela Williams-Ngumbu, MA, plus two further expatriate

linguists; Congolese linguist; language speaker; driverwordlist 150-word listtexts 9 hours texts, including narrative, history, procedural/descrip-

tive, interview/conversation, religious/religious music

tabLE 4. Teke-tyee [tyx], Republic of Congo

The documentation effort in Teke-tyee involved a total of two and one-half weeks. Williams-Ngumbu worked with a Congolese linguist and two other expatriate linguists, a

To BOLDly Go Where No One Has Gone Before 220

language facilitator, and their driver. Their work was endorsed by a local teacher and pas-tor. The informed consent was normally an audio-recorded oral consent.

During the fieldwork period, the team collected a wordlist, plus approximately nine hours of textual data, including narrative, history, procedural, descriptive, interviews, con-versation, and religious texts. The data is backed up on an external hard drive, and is not yet ready for archiving.

An innovation by the Teke-tyee team was the use of two separate elicitation and pro-cessing sites, which was possible because they had adequate equipment, personnel, and recording locations. The school inspector made both his house and his office available for this work, which allowed the team to finish more of the oral commenting during their vil-lage visit than would have otherwise been possible.

Williams-Ngumbu also commented on the advantages of having national and local co-workers as part of the team. She felt that this eased their acceptance into the community and enhanced the quality of their elicitations.

5.4 LAARI [ldi], REPUBLIC OF CONGO AND PARTS OF ANGOLA. Carla Unseth inves-tigated Laari [ldi] for her MA thesis, with the research question of whether it is possible to write a phonology sketch based solely on a BOLD corpus—1,000–2,000 wordlist elici-tations, plus 10 hours of recordings of running text—since this is the core proposed by BOLD. If successful, her phonology sketch will be used to revise the Laari orthography, which is currently found unreadable. Laari has over 90,000 speakers in the Republic of Congo and in parts of Angola, most of whom are multilingual in French and other languag-es of the region, such as Kituba [ktu] and Lingala [lin]. Unseth assigns Laari an EGIDS level 6a “vigorous,” because not only is it the first language of children in the villages around Brazzaville, but it is also learned by children whose parents are not Laari speakers. In Brazzaville, children of Laari-speaking parents learn both French and Laari.

language Laari [ldi], pop. 90,600+ contact multilingual; French [fra]EGIDS Level 6a “vigorous”; first language of village children; learned

with French in Brazzaville; children whose parents speak other first languages learn it, too

goals test whether a BOLD oral-only corpus is sufficient for writing a phonology and then revising an orthography

duration 4 weeks during September 2010; 2 trips of 1 day to villages, plus working with speakers at a regional center

personnel Carla Unseth, MA student; A. Williams-Ngumbu, MA, facilitator, documented [tyx] above

wordlist 1,700 words texts 9.3 hours, including 27 narrative, 4 proverbs, 3 descriptive, 3

procedure, 4 conversation, 1 riddle, 2 other

tabLE 5. Laari [ldi], Republic of Congo & parts of Angola

To BOLDly Go Where No One Has Gone Before 221

Unseth’s research was facilitated by Williams-Ngumbu (section 5.3), who acted as her cultural and technical consultant. Unseth was required by GIAL’s Human Subjects Research Committee to obtain written consent. This was done for literate speakers, with recorded oral consent from illiterate speakers.

Based on Williams-Ngumbu’s previous experience, Unseth and Williams-Ngumbu set up two data elicitation and processing sites. This allowed them to obtain a wordlist of 1,700 words, composed of wordlist items, paradigms, and minimal pairs as described in section 3.2.2: seven hours of textual recordings, including narrative, descriptive, procedural, prov-erbs, conversation, and a riddle; and a three-hour recording of a church service during the time Unseth spent in the field. The wordlist numbers are within the target of 1,000–2,000 words for a BOLD core corpus. It is unclear, though, whether the three-hour church ser-vice recording went through BOLD processing or not. If not, then the project fell short of ten hours of running text and was also shorter than the suggested eight week project. For Unseth’s research question, though, the wordlists should prove to be adequate since they are the main data set for phonological analysis; her work to determine that is on-going.

One of the more difficult aspects of BOLD is training target-language speakers to listen and break up the original recording into phrases at the same time as they are manipu-lating recorder buttons for listening to the original and recording the oral transcription or translation. Reiman (2010:218–219 and personal communication) has found that he often has to let several people attempt the training before identifying someone who can do this well. Then, even when such an individual is identified, they often need multiple practice runs before they are comfortable and competent to carry out the task.

Therefore, it is not surprising that Unseth found that those doing the oral transcription and translation for Laari did not successfully carry out the instructions they were given. Based on their previous experience, they paraphrased and summarized entire paragraph-length passages like interpreters might do, rather than giving phrase-by-phrase repetitions of what the original speaker said in slow, careful speech. There were multiple sentences of the form, “He said that...,” which changes the person category of verbs, if nothing else.

So there were two errors in methodology: first, a failure to monitor the training process to ensure that a trained transcriptionist had correctly learned the methodology, and second, failure to review results as they were being achieved in order to ensure that all was going well. This means that training is a critical component of BOLD. By not following up and checking the work being produced, a portion of Unseth’s textual data is of questionable worth for syntactic analysis. To address this, she suggests that having a native speaker of the target language on the documentation team would help identify such problems so that they can be corrected. Such a team member could also help facilitate variety in the texts which are collected, which addresses another problem Unseth encountered when multiple people in a group told the same story because each thought he could improve on the previ-ous person’s version.

Laari is spoken in the area around Brazzaville, so Unseth and Williams-Ngumbu were based in Brazzaville, but made trips to two area villages. Unseth found that the oral transla-tors she worked with in Brazzaville were no longer operating in that language on a daily basis, and therefore had difficulty recalling some vocabulary. This is similar to Bird et al.’s report (2011) of students who participated in his BOLD:PNG projects, who brought data back to their universities, thinking that they would do the LWC translations there, and only

To BOLDly Go Where No One Has Gone Before 222

then realized that they did not know all the words. This means it is important to do the oral transcription and oral translation of texts in the villages where the language is in common use, or to be sure that a capable speaker is available if the work is done outside the language center.

In the syntactic elicitations, Unseth also recommends eliciting not just paradigms, but also a set of paradigmatic sentences, which the language consultant glosses word for word, thereby achieving the word-for-word data deemed essential by Himmelmann (2009). Just a few such sentences can give a significant boost to morpho-syntactic and textual analysis.

5.5 LOMWE [lon], MALAWI. The last two BOLD projects were facilitated by Reiman, who has co-taught the GIAL language documentation course three times. During the summer of 2010, he led a team of undergraduate students who learned and executed the BOLD recor-ding techniques. The first of these was for the Lomwe [lon] language of Malawi, which has an alternate name, Emihavane. There are 250,000 speakers of Lomwe. Reiman assigns it an EGIDS level 6a “vigorous,” because children are still routinely learning the language. However, this level is precarious since it is a smaller language which has extensive contact with other languages, including Chichewa and English. Furthermore, the orthography is not settled and there is no vernacular education.

language Lomwe [lon], pop. 250,000; Emihavane is alternate namecontact English [eng] and Nyanja [nya], more commonly called

Chichewa, are official languages EGIDS Level 6a “vigorous”; extensive contact with other languages;

orthography not settled; not taught in school goals word list intensive; evaluation for orthography; expose stu-

dents to BOLD; hands-on experienceduration 1 week during June 2010personnel D. Will Reiman, MA plus 2–4 US undergrads, under 22wordlist 1,700-2,000 wordstexts 9 hours texts, including narrative, history, procedural/descrip-

tive, interview/conversation, religious/religious music

tabLE 6. Lomwe [lon], Malawi

The goals of the project were to obtain the data needed for evaluation of orthography decisions, and at the same time to expose the students to BOLD and give them on-site training and experience using it. Given the first part of the goal statement and their time limit of one week, the corpus they collected was necessarily wordlist-intensive. The two to four students involved were not linguistics majors, and they were all under the age of 22.

The language consultants crossed the border from Malawi to Mozambique with an expatriate advisor in order to work with Reiman and his team. Oral consents were recorded with both audio and video. The team collected between 1,700 and 2,000 words. There are plans to archive the data with SIL.

To BOLDly Go Where No One Has Gone Before 223

5.6 MANYAWA [mny], MOZAMBIQUE. Reiman’s second project reported on here was in the Manyawa [mny] language of Mozambique. It has 173,000 speakers and is in contact with Portuguese, which is the official language of the country. The EGIDS level for Man-yawa is 5 “written,” because it is still learned as a first language by children, and there is an orthography in use. The language is not used in the educational domain.

language Manyawa [mny], pop. 173,000 contact Portuguese [por] is the official languageEGIDS Level 5 “written”; children learn it as first language; has or-

thography; not taught in schoolgoals text collection; compare to Takwane [tke]; expose students to

BOLD; hands-on experienceduration 1 week during June 2010personnel D. Will Reiman, MA plus 2–4 US undergrads, under 22wordlist 600-word listtexts 14 narratives, 10 riddles/proverbs, 4 fables and genre terms:

chaluate ‘riddle’; musibe ‘proverbs’; itale ‘fables’

tabLE 7. Manyawa [mny], Mozambique

The goals for the two students working with Reiman for one week on this project car-ried over from the previous project. The linguistic goal, however, was different: to obtain data for syntactic comparison with Takwane [tke]. This led them to focus on text collection. Informed consents were handled the same as in the other project.

The team ended up with a documentation corpus containing a wordlist of 600 words, 14 narratives, three histories, four fables, six riddles, and four proverbs. Of these, the three histories, one fable, and one proverb received oral annotation. The team had a co-worker who spoke Lolo [llb], a related language, and his knowledge contributed to the richness of genres elicited. He was able to describe several genres from his own language, which then elicited examples of parallel genres from Manyawa speakers, including the vernacular vocabulary for the genres—riddle, proverb, and fable.

6. IMPLICATIONS OF SURVEY FEEDBACK FOR BEST PRACTICES IN BOLD. In addi-tion to a description of each project, I also asked each researcher for comments on pros and cons of their BOLD experiences, sharing of innovations they made, and suggestions for improvement. The result was 26 unique comments, which I divided into five categories: (a) pros inherent in BOLD, (b) pros shared by any documentation effort, (c) cons shared by any documentation effort, (d) cons inherent in BOLD, and (e) innovations to BOLD during fieldwork. These are tabulated in Table 8, with a column for comment reference numbers, another for the categorized comments, and a third in which I suggest ways to address the negatives and to incorporate the innovations.

To BOLDly Go Where No One Has Gone Before 224

SURVEY COMMENTS MODIFICATIONS TO MAKE Pros inherent in BOLD

1 Less time needed Increase number of BOLD projects to increase docu-mentation speed2 Less training needed

3 Less funding needed4 Team approach

Pros reported shared by any documentation effort

5 Greater community participa-tion means community pride is bolstered

Students could lead one doc project and later facilitate another

6 Documenter skills improve with each field documentation project

Require a (BOLD) documentation project for all MA or PhD linguistics degrees, documenting a language with no or minimal documentationLinguists could facilitate several doc projects in region of expertise

7 Possible to get more accurate data in village

More batteries and generator or solar for more com-munity time

8 Make written transcription of wordlist as it is recorded

Written transcription of wordlists, already best practice in BOLD by linguists

Cons reported shared by any documentation effort

9 Documenter needs many kinds of skills

Requires training in weak areas

10 [Same as #9] People on team with complementary abilities11 Documenter got same story mul-

tiple timesTarget language or related language speaker on the doc team

12 Wordlist consultant did not have target language as first or primary language plus Bird (2011)

Confirm there are native speakers to do careful speech and LWC

13 Headset microphone for consultant was uncomfortable and consultant moved sound-proofing mattress away from window to get a breeze

Be mindful of language consultant physical comfort

14 Audio-only recording may be in-sufficient for future uses of corpus

Increase use of video, as adjunct to audio recordings

15 Quantity and scope of data may be inadequate for future use

Determine purpose prior to fieldwork and plan elicita-tion accordingly

16 Spent too much time changing batteries when grid power was available

Use grid power if available

To BOLDly Go Where No One Has Gone Before 225

17 Confusion of two lingua francas Have a national linguist on the team, if available 18 Inadequate preparation for how

culture affects fieldwork relationsIncrease cultural awareness component in pre-field training

19 Lack of archiving follow-through Make archiving a part of grant, course, graduation requirements

20 Used different equipment in the field than in training

Know your equipment before starting to document with it

Cons possibly inherent in BOLD methodology

21 Too intense; time pressure to acquire 10-hour corpus in four weeks

Allow for a minimum of six weeks’ fieldwork for 100,000-word corpus, eight to ten weeks preferred

22 Phrasal translation may be inadequate for future analysis

Add word- or morpheme- layer oral gloss of a portion of data: how much? Get literate people to write out elicited material Already best practice to elicit sentences and paradigms

23 People providing careful speech did not follow procedures as taught

Follow up on training of consultants to be sure tasks are done well

Innovations to BOLD made during fieldwork

24 Set up two BOLD collection and annotation sites

Consider equipment modifications for two data collec-tion sites

25 Use Audacity or other software to pre-segment phrases

Use Audacity to pre-segment originals prior to careful speech recordings

26 Cultural interpreter facilitated project logistics

Go accompanied by community insiders

tabLE 8. Best practices for BOLD

From the comments, four significant generalizations could be made. The first of these is that having someone on the team with cultural awareness or experience significantly increases the effectiveness of the research, whether that person is from the target language group, a related language group, a national linguist, or an expatriate fieldworker with pre-vious experience in the area. For example, as noted above, the Manyawa team was able to elicit the names of several genres through the assistance of a speaker of a related language. While including an insider may seem axiomatic, in the context of the possibility of having paralinguistic technicians collect endangered language data, the added benefit of having an on-site cultural facilitator should not be underestimated. This also means that those who have field experience should be mentoring others in the field.

A related generalization is that the advantages of doing as much data gathering and processing on site as possible outweigh any advantages of going elsewhere for that stage. The survey respondents found that working in the village as much as possible, as opposed

To BOLDly Go Where No One Has Gone Before 226

to working in a regional urban center, not only improved the quality of the data, but also had beneficial effects on the community and the researchers’ relations to the community. However, doing this can also mean compromising the quality of recordings for the oral transcription and translation phases.

A third discomfort was expressed by two of the six documenters prior to any attempt at analysis of their data. They were concerned with what they see as the low level of detail in the glossing available in a BOLD corpus. To address this for syntactic analysis, another layer could be added for a selected number of texts9 through oral word-by-word glossing. Similarly, the written component of BOLD corpora produced by linguists could be in-creased from only transcribing elicited lists to include written glossing of selected texts in the vernacular or an LWC, especially where the consultants are literate in some language.

Finally, three respondents also felt that the proposed BOLD data collection rate was too intense and/or unrealistic. They indicated that data collection five days per week and six hours per day left little time for general interactions with the community, and that there was a tension between meeting data collection goals and relating well to the community: namely, being goal-oriented versus event- and people-oriented.

Williams-Ngumbu expressed a similar concern. She has facilitated four BOLD pro-jects to date: the two reported on here (Teke-tyee [tyx] and Laari [ldi]), as well as two others. Based on her experience, she questioned the likelihood of completing oral tran-scription and translation on 10 hours of running text within a four-week period. Therefore, the hypothetical amount of time BOLD documentation takes needs to be revised upward in light of this actual field experience. In Table 9, I record what I have calculated to be a reasonable upper time limit for processing an hour of textual material based on metadata from several of the projects and extended e-mail discussions with Williams-Ngumbu. This correlates with her experience that the digital recordings of careful speech are three times as long as the original, and that translation recordings are twice as long as the careful speech recordings.

In the table, I assume four speakers each speaking for 15 minutes to reach one hour of recording. This means that some tasks related to the original recording are multiplied by four. But in the oral transcription phase, I assume there is one individual who does the entire hour composed of four 15-minute recordings. Similarly, in the translation into an LWC, I assume only one consultant, but not the same person who made the oral transcrip-tion, so there are at least two individuals who need to be trained in the BOLD procedures as described by Reiman (2010).

9 Part of our development of best practices for BOLD would be to determine how many clauses from what kinds of texts might serve as a minimum to receive the extra layer of glossing, and then to test whether a grammar sketch could be written with that as the foundation for analysis of the texts that have not received the morpheme-by-morpheme treatment.

To BOLDly Go Where No One Has Gone Before 227

Simons (2008)

current estimates

collect original recording * set up equipment x 1 40 min * greetings and settle language consultant(s) x 4 10 min x 4 = 40 min * informed consent discussion and recording x 4 10 min x 4 = 40 min * make 15-minute recording x 4 speakers 15 min x 4 = 60 min time spent in first phase est. 3 hr 3 hr orally transcribe * set up equipment 10 min * greetings and settle consultant 5 min * informed consent discussion and recording 5 min * training in BOLD careful speech techniques 20 min * practice and retakes for mistakes 20 min * recording totaling 3 hr time spent in second phase est. 3 hr 4 hr orally translate * set up equipment 10 min * greetings and settle consultant 5 min * informed consent discussion and recording 5 min * training in BOLD translation techniques 20 min * practice and retakes for mistakes 30 min * total recording 6 hr time spent in third phase est. 5 hr 7 hr corpus management tasks * metadata sheets for each phase of each re-cording: original, transcription, translation, plus informed consent notes

45 min

* keyboard metadata into computer 45 min * copy and back up files daily and be sure they copied

30 min

time spent in metadata and management est. 1 hr 2 hr TOTAL 12 hr 16 hr, upper est.

tabLE 9. Estimated and actual times for recording and processing one hour of speech

To BOLDly Go Where No One Has Gone Before 228

This estimate of 16 hours for recording, orally annotating, and responsibly adminis-tering a one-hour recording may be on the high end in light of the following. Obviously, once the team has trained one or two oral transcribers and the same number of oral transla-tors, then the training phase and informed consents can be eliminated or shortened in sub-sequent annotation sessions. Similarly, as an inexperienced documenter gains experience in handling the day’s metadata and project administration, this, too, may be done more efficiently. Therefore, a more reasonable compromise estimate might be an average of 14 hours of processing time. This is higher than Simons’ 12 hours, but less than our outside estimate of 16 hours.

6.1 BOLD REVISIONS ADDRESSING SURVEY RESULTS. These results have implications for the design of BOLD fieldwork projects. Assuming the same original goal of collecting a core BOLD corpus of 10 hours of running text and a wordlist of 1,000–2,000 items, other adjustments are necessary in order to address the four main findings discussed above. One combination of elements which encompasses these four results is provided in Table 10, which is a revision of the original proposal in Table 1.

personnel transitions 10 hr text data (140 hr)

1,000–2000 word list

word-level glossing

1 documenter 1.5 weeks, 1 at start, 3–4 days at the end

1 month at 6 hours/day, 4 days/week = 96 hrs

2 weeks at 250 words/day, 4 or 5 days/week

3–4 days

1 paid assistant trained in the field

none; bilingual target language speaker

6 hours/day, 4 days/week for 6 weeks = 144 hours

1 TSC commenter none 6 hours/day, 2 days/week for 6 weeks = 72 hours

1 TSL commenter none 6 hours/day, 2.5 days/week for 6 weeks = 90 hours

tabLE 10. Revised logistics for a core 2-month BOLD corpus project

One way to include a cultural insider on the team would be to identify and recruit a target language speaker who is also bilingual in the LWC, but who lives most of the time in the area where the target language is spoken. Other ways of meeting this need are also possible, as illustrated by the variety found in the six projects comprising this study.

The second concern was related to the amount of time spent doing BOLD processing on site. This is not reflected in the table, but allowance for it needs to be made in other lo-gistical planning. The third concern was the lack of word-level data—whether oral or writ-ten. An extra column was added to Table 10 to allow for an additional three or four days for obtaining word-level data, with these days subtracted from the two weeks of transition

To BOLDly Go Where No One Has Gone Before 229

time in the original proposal. Other adjustments are also possible, such as working seven- or eight-hour days. The suggested target material for word-level processing was texts and example sentences illustrating paradigmatic material.

Since one of the advantages claimed for BOLD is the decreased duration of fieldwork as compared to more traditional elicitation, the proposal in Table 10 maintains the two-month limit on field time and makes adjustments elsewhere to both increase processing times (as in Table 9), and to decrease the number of days worked per week in an effort to decrease the perceived intensity expressed by some of the respondents. To allow the docu-menter to process data only four days per week, and at the same time to process for 140 hours as opposed to 120, the same individual who serves as the cultural insider can also be trained as an equipment manager and handler, who also discusses informed consent. Additionally, a minimum of two further team members should be trained—one for oral transcription and one for oral translation.

6.2 FEASIBILITY OF CONTINUING TO USE BOLD. In the discussion above, all four findings regarding changes to best practices for BOLD—cultural insider assistance, on-site processing, increased gloss detail, and decreased intensity—have been addressed. Unseth is investigating whether she can write a phonological analysis of Laari from the BOLD corpus which she collected, using her IPA transcriptions of the wordlist. The outcome is unsure, especially since some of her textual data was not properly annotated, as described above—a less than optimal application of a BOLD procedure, rather than an inherent shortcoming in how BOLD is conceptualized or designed. It is possible that the wordlist data will prove sufficient for her to determine whether a BOLD corpus is an adequate data source for writing a phonology sketch, but allophonic variation data is sparse and the final result remains to be seen.

It also remains to be proven whether or not a properly executed, purely oral documen-tation undertaken by a paralinguist can serve as a stand-alone documentation effort for future linguistic analysis. No one is denying that making changes in BOLD to address the four survey generalizations would improve the archived corpora and make their processing easier, both now and in the future. Surely written transcription of wordlists, morpheme- level oral glossing, written text glossing in an orthography, and decreased elicitation inten-sity should be incorporated to the degree that documenter and consultant skills and time allow.

Considering the survey results as a whole, then, there is nothing in the cons which indicate the BOLD strategy should be abandoned. In fact, given the pros cited for BOLD, a more focused study is warranted to specifically apply BOLD in controlled projects to rigorously test its claims. Such a process could lead to a basic oral documentation method whose corpus can function as a core and whose flexibility allows for modifications of pa-rameters, depending on community and researcher goals and constraints.

7. CONCLUSION. But why is having such a tool important? My concern here is with un-documented languages and unbalanced allocation of resources. That is, there are languages dying with no documentary record, while others, which have seen concerted linguistic fieldwork, receive a disproportionate amount of funding. BOLD gives us the means to

To BOLDly Go Where No One Has Gone Before 230

document an undocumented language effectively with minimal investment of human, tem-poral, and financial resources.

We should avoid the tendency to continue adding more and more data parameters to an increasingly unwieldy documentary wish list. Waiting until we can do it all and do it “right,” by whatever definition we use, will not result in corpora which capture the speech practices of communities around the world before they disappear. Neither is it effective to pour significant resources into languages which have essentially been documented. For the yet-to-be-documented languages, then, it would be productive to agree that something is better than nothing. Much more than eventual linguistic description is at stake. For ex-ample, a BOLD corpus could also later be applied to linguistic training, language lear ning, literacy and literature development, heritage preservation, and language revitalization (Si-mons 2008)—the same concerns we all share.

In fact, BOLD could be among our most useful strategies for moving forward. But it is unlikely that the handful of individuals currently learning BOLD each year and then apply-ing it are going to generate sufficient corpora for every language group still lacking a pri-mary data corpus. Instead, two things need to happen in order for primary documentation rates to increase. First, others also need to begin producing oral-based corpora as proposed by BOLD. And second, funding resources need to be directed toward those languages in which little or no documentary fieldwork has been done.

Given the smaller time commitments involved in preparing a BOLD corpus as com-pared to a fully written corpus, one way to significantly increase world-wide BOLD docu-mentation rates would be for linguistics departments to make gathering and archiving a mostly oral documentation corpus mandatory for graduate students, much like the study of a non-Indo-European language has been. This would assure that a minimum, but represen-tative, amount of material is available for each language. The use of a BOLD approach, as defined and then refined above, means that a basic documentation corpus can be collected in two months. Furthermore, the results of the six projects in this study show that, to en-hance the effectiveness of documentary fieldwork, an insider of some kind is necessary. One application of this would be those with field experience doing on-site facilitation of such strategic documentations.

Regarding focused funding of projects aimed at archiving primary data, we need to remind ourselves of Himmelmann’s (1998) discussion of the distinction between docu-mentation and description, and then make a further distinction. He said:

My interest here pertains to the first activity, i.e. the collection, transcription and translation of primary data. This activity is called the documentary activity, its product is a language documentation, and the affiliated field is documentary linguistics.

If we accept this widely-quoted definition of documentary linguistics, then language development activities like “revitalization” or “digitization of legacy data” are, strictly speaking, outside the scope of documentary linguistics. Such activities could be termed “conservationist linguistics.” To put it another way, the primary data of the documentary corpus, as defined by Himmelmann, is the basis for both descriptive and conservationist

To BOLDly Go Where No One Has Gone Before 231

linguistics. Language description and conservation activities do not adequately address the dire predictions of language death in Crystal (2000) and Krauss (2002). So, given the tri-age situation in which we find ourselves, it is critical to recognize this distinction between documentary and conservationist linguistics in order that resources may be prioritized for documentary linguistics, and thereby expedite the collection and archiving of new, primary data in undocumented and underdocumented endangered languages while it is still pos-sible. Not doing this puts us in danger of having no language to describe and no source for language revitalization.

To conclude, documentation protocols need to be modified, lest documenters end up producing many digital recordings while failing to address the high rate of language death. Therefore, our strategy for the next decades should do three things: focus resources on primary data collection, increase documentation rates by requiring oral documentation cor-pora for advanced linguistics degrees, and continue refining best practices for oral docu-mentation. Such efforts would combine to make it ever more possible for documenters to continue to BOLDly go where no one has gone before.

RefeRences Bird, Stephen, Anastasia Sai, Philip Tama & Sakarape Kamene. 2011. Equipping univer-

sity students to document their ancestral languages. Paper presented at the 2nd Interna-tional Conference on Language Documentation & Conservation. University of Hawai‘i at Mānoa. http://hdl.handle.net/10125/5188.

BOLD:PNG. 2010. http://www.boldpng.info/. (August 2011.)Bouquiaux, Luc & Jacqueline M. C. Thomas. 1992. Studying and Describing Unwritten

Languages. Translated by James Roberts. Dallas: Summer Institute of Linguistics. Crystal, David. 2000. Language death. Cambridge: Cambridge University Press Fishman, Joshua. 1991. Reversing language shift. Clevedon, UK: Multilingual Matters.Hale, Kenneth L., Colette Craig, Nora England, Laverne Masayesva Jeanne, Michael

Krauss, Lucille Watahomigie & Akira Yamamoto. 1992. Endangered Languages. Lan-guage 68(1). 1–42.

Himmelmann, Nikolaus. 1998. Documentary and descriptive linguistics. Linguistics 36. 165–191. http://www.hrelp.org/events/workshops/eldp2005/reading/himmelmann.pdf.

Himmelmann, Nikolaus. 2009. Linguistic data types and documentary linguistics. Plenary address at the 1st International Conference on Language Documentation & Conserva-tion. University of Hawai‘i at Mānoa. http://hdl.handle.net/10125/5162. (25 July 2010.)

Krauss, Michael. 1992. The world’s languages in crisis. Language 68(1). 4–10. Lewis, M. Paul & Gary F. Simons. 2010. Assessing endangerment: Expanding Fishman’s

GIDS. Revue Roumaine de Linguistique 55(2). 103–120. http://www.lingv.ro/resourc-es/scm_images/RRL-02-2010-Lewis.pdf

Liberman, Mark. 2006. The problems of scale in language documentation. Talk presented at the Texas Linguistics Society X Conference: Computational Linguistics for Less-Studied Languages. University of Texas at Austin. http://uts.cc.utexas.edu/~tls/2006tls/abstracts/pdfs/liberman.pdf [abstract].

Reiman, D. Will. 2010. Basic oral language documentation. Language Documentation & Conservation 4.

To BOLDly Go Where No One Has Gone Before 232

Simons, Gary F. 2008. The rise of documentary linguistics and a new kind of corpus. [Powerpoint presentation.] http://www.sil.org/~simonsg/presentation/doc%20ling.pdf

Woodbury, Anthony C. 2003. Defining language documentation. In Peter K. Austin (ed.), Language Documentation and Description 1. 35–51. London: SOAS.

Woodbury, Anthony C. 2007. On thick translation in language documentation. In Peter K. Austin (ed.), Language Documentation and Description 4. 120–135. London: SOAS.

Photo Credits: All photos used with permission and with appropriate audio informed consent of those depicted.

Brenda H. [email protected]

To BOLDly Go Where No One Has Gone Before 233