constructing measurement models of l2 linguistic

21
JLTA Journal, vol. 22: pp. 23-43, 2019 Copyright© 2019 Japan Language Testing Association DOI: 10.20622/jltajournal.22.0_23 Print ISSN 2189-5341 Online ISSN 2189-9746 Constructing Measurement Models of L2 Linguistic Complexity: A Structural Equation Modeling Approach Takeshi KATO Graduate School, University of Tsukuba Abstract Over the last four decades, the constructs of complexity, accuracy, and fluency have been in focus in the analysis of language learners’ performance. However, due to the polysemous nature of complexity, more and more sub-constructs have been assumed, making holistic measurement difficult. This study aims to construct a more appropriate measurement model of L2 complexity by implementing finer-grained and relatively novel linguistic indices for capturing subordinate constructs that could not be measured by conventional indices. By utilizing five natural language processing tools, conventional and fine-grained indices of complexity were computed from 503 argumentative essays written by Japanese English learners. First, exploratory factor analysis was performed on linguistic index values and the extracted factor structures behind them. Second, confirmatory factor analysis was conducted to confirm whether the structure fits the data. Finally, a structural equation model of complexity constructs to predict essay scores was tested to evaluate its applicability to writing evaluation. The result of a series of factor analyses showed that the extracted factor structures reasonably fitted to the data for syntactic complexity (CFI = .901 and RMSEA = .071) and for lexical complexity (CFI = .978 and RMSEA = .051). Furthermore, the result of Structural Equation Modeling (SEM) analysis, which was proposed as a predictive model, accounted for 32.3 % of the variance of essay scores (CFI = .916 and RMSEA = .077). Overall, the findings showed the effectiveness of the proposed approach, which combined conventional linguistic features with fine-grained and relatively novel indices. Keywords: syntactic complexity, lexical complexity, construct, writing assessment, automated essay scoring Textual features have been considered and utilized as a link between language learners’ proficiency and their performances of target languages in language assessment and developmental language studies (Wolfe-Quintero, Inagaki, & Kim, 1998). Over the past 45 years, the complexity, accuracy, and fluency (CAF) triad has been the most frequently used framework in this research area (Housen, Kuiken, & Vedder, 2012). However, the polysemous nature of complexity has caused several methodological issues regarding the 23

Upload: others

Post on 02-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Constructing Measurement Models of L2 Linguistic

JLTA Journal, vol. 22: pp. 23-43, 2019 Copyright© 2019 Japan Language Testing Association DOI: 10.20622/jltajournal.22.0_23 Print ISSN 2189-5341 Online ISSN 2189-9746

Constructing Measurement Models of L2 Linguistic Complexity: A Structural Equation Modeling Approach

Takeshi KATO

Graduate School, University of Tsukuba

Abstract Over the last four decades, the constructs of complexity, accuracy, and fluency have been in focus in the analysis of language learners’ performance. However, due to the polysemous nature of complexity, more and more sub-constructs have been assumed, making holistic measurement difficult. This study aims to construct a more appropriate measurement model of L2 complexity by implementing finer-grained and relatively novel linguistic indices for capturing subordinate constructs that could not be measured by conventional indices. By utilizing five natural language processing tools, conventional and fine-grained indices of complexity were computed from 503 argumentative essays written by Japanese English learners. First, exploratory factor analysis was performed on linguistic index values and the extracted factor structures behind them. Second, confirmatory factor analysis was conducted to confirm whether the structure fits the data. Finally, a structural equation model of complexity constructs to predict essay scores was tested to evaluate its applicability to writing evaluation. The result of a series of factor analyses showed that the extracted factor structures reasonably fitted to the data for syntactic complexity (CFI = .901 and RMSEA = .071) and for lexical complexity (CFI = .978 and RMSEA = .051). Furthermore, the result of Structural Equation Modeling (SEM) analysis, which was proposed as a predictive model, accounted for 32.3 % of the variance of essay scores (CFI = .916 and RMSEA = .077). Overall, the findings showed the effectiveness of the proposed approach, which combined conventional linguistic features with fine-grained and relatively novel indices. Keywords: syntactic complexity, lexical complexity, construct, writing assessment, automated essay scoring

Textual features have been considered and utilized as a link between language learners’

proficiency and their performances of target languages in language assessment and developmental language studies (Wolfe-Quintero, Inagaki, & Kim, 1998). Over the past 45 years, the complexity, accuracy, and fluency (CAF) triad has been the most frequently used framework in this research area (Housen, Kuiken, & Vedder, 2012). However, the polysemous nature of complexity has caused several methodological issues regarding the

- 23 -

Page 2: Constructing Measurement Models of L2 Linguistic

measurement of the construct (Bulté & Housen, 2012; Pallotti, 2009). In both written and spoken language analyses, complexity has generally been divided into two aspects: syntactic complexity and lexical complexity (Sakuragi, 2011; Wolfe-Quintero et al., 1998). Problems in measuring these two constructs are outlined below.

As for syntactic complexity, large-grained linguistic indices represented by mean length of T-unit were most frequently used in its measurement (Ortega, 2003). However, these indices are difficult to interpret, since it is not clear what kind of complexification is operationalized in language performance (Ortega, 2012). In addition, Biber, Gray, and Poonpon (2011), who comprehensively investigated the linguistic features of a large-scale English speaking and writing corpora using Biber Tagger, reported that linguistic indices that discriminate proficiency in writing are phrase-level elaboration (e.g., attributive adjectives and prepositional phrases as modifiers for nouns). According to the conceptual model of syntactic complexity proposed by Bulté and Housen (2012, p. 27), this construct reflects syntactic sophistication and diversity. Wolfe-Quintero et al. (1998) also define this construct as “a wide and variety of both basic and sophisticated structures [that] are available to the learner” (p. 69). However, such norms (i.e., diversity and sophistication) have rarely been investigated (Yoon, 2017).

Regarding lexical complexity, learners’ lexical diversity and sophistication have been operationalized into linguistic indices for a single word, but this construct does not only consist of lexemic complexity; it also involves collocational complexity at the theoretical level (Bulté & Housen, 2012). The number of studies that focus on the frequency and range of multiword units has recently been increasing (e.g., Biber et al., 2004). In accordance with these research trends, newer indices that consider n-gram frequency and range have been developed and employed; the relationship between n-gram frequency and analytic scores of lexical proficiency (Kyle & Crossley, 2015) and correlations between n-gram range and human judgments of L1 independent writing tasks (Crossley, Cai, & McNamara, 2012) have been reported. In this way, lexemic complexity and collocational complexity have been recognized and measured individually, but they have not been empirically modeled as subordinate constructs of lexical complexity as is implied in Bulté and Housen’s (2012) definition.

To meet the above academic needs, the Tool for Automated Analysis of Syntactic Sophistication and Complexity (TAASSC; Kyle, 2016) and the Tool for Automated Analysis of Lexical Sophistication (TAALES; Kyle & Crossley, 2015) were developed. These analyzers have recently been utilized in L2 writing studies, and it is reported that with their predictive models, which include fine-grained and novel indices, it is possible to predict proficiency with higher accuracy than models using only conventional indices (e.g., Kim, Crossley, & Kyle, 2018; Kyle, Crossley, & Berger, 2018).

This study aims to construct more appropriate measurement models of complexity by combining conventional subordinate constructs of syntactic and lexical complexity, and sub-constructs that have not been measured (i.e., syntactic diversity and sophistication) or

- 24 -

Page 3: Constructing Measurement Models of L2 Linguistic

modeled (i.e., range norms and n-gram sophistication).

Literature Review L2 Complexity as a Multidimensional Construct Several previous studies have considered complexity as a conceptual construct. Ellis and Burkhuizen (2005) defined it as “[the] use of more challenging and difficult language… Complexity is the extent to which learners produce elaborated language” (p. 139). Wolfe-Quintero et al. (1998) stated that “grammatical and lexical complexity mean that a wide and variety of both basic and sophisticated structures and words are available to the learner” (p. 69, p. 101), and Skehan (2003) defined complexity as “the complexity of the underlying interlanguage system developed” (p. 8). More recently, Housen et al. (2012) summarized the definitions offered by previous studies and proposed the following working definition: “the ability to use a wide and varied range of sophisticated structures and vocabulary in the L2” (p. 2). Researchers have attempted to capture these conceptual notions with linguistic features extracted from learners’ spoken and written performances.

According to Bulté and Housen (2012), complexity can take two forms: absolute complexity and relative complexity. The former refers to formal linguistic complexifications (e.g., the amount of subordination), while the latter indicates the relative difficulty of learning and processing particular linguistic structures related to input frequency and contingency.

Among the several subordinate constructs of complexity, linguistic complexity is the most-studied area of complexity in research exploring the developmental state of learners’ interlanguage by utilizing linguistic features in performance. This study focuses on syntactic and lexical complexity, both of which have been keenly adopted in numerous studies (e.g., Kato & Ono, 2019; Lu, 2010, 2012; Norris & Ortega, 2009; Wolfe-Quintero et al., 1998).

Figure 1 summarizes arguments made in previous researches and presents a structure of the subordinate constructs of linguistic complexity. The following sections describe syntactic complexity and lexical complexity in detail based on this figure.

Figure 1. A summarized taxonomic model of linguistic complexity.

- 25 -

Page 4: Constructing Measurement Models of L2 Linguistic

Syntactic Complexity Syntactic complexity, defined as “the range of forms that surface in language

production and the degree of sophistication of such forms” (Ortega, 2003, p. 492), has been an important construct in studies on the development of L2 proficiency. Both systematic corpus studies (e.g., Biber et al., 2011; Lu, 2011) and comprehensive research syntheses (e.g., Norris & Ortega, 2009; Ortega, 2003; Wolfe-Quintero et al., 1998) have validated and verified the applicability of syntactic complexity measures to L2 writing measurement and development studies. Conventionally, it has been assumed that complex linguistic structures with varied types of embedded elements reflect learners’ proficiency levels based on language system development. Therefore, several global or clause-level indices of linguistic complexity have been widely employed in L2 writing research (Ortega, 2003). Clausal subordination measures have predicted L2 proficiency levels satisfactorily (Wolfe-Quintero et al., 1998) and clausal coordination has been used to predict proficiency of language development at beginning levels (Bardovi-Harlig, 1992). Lexical Complexity

Many studies have investigated the vocabulary knowledge of L2 learners (e.g., Meara, 1996; Read, 2000). Most of those studies have assumed that the words in learners’ language usage reflect vocabulary knowledge and can be used to verify learners’ general L2 proficiencies (e.g., Daller & Xue, 2007; Jarvis, 2013; Lu, 2012; Saito, Webb, Trofimovich, & Isaacs, 2016).

Multiple technical terms have been devoted to productive vocabulary knowledge, the most common of which is lexical richness. The concept captures the overall quality of the vocabulary in a linguistic sample (Daller, Van Hout, & Treffers-Daller, 2007; Milton, 2009; Read, 2000). Many researchers have followed Read (2000) in using the concept of lexical richness as a superordinate construct to cover all lexical constructs and their associated measures.

In CAF-based research, lexical richness has been treated generally as lexical complexity since the target subordinate constructs and linguistic phenomena are shared. Bulté and Housen’s (2012, p. 28) model of lexical complexity includes lexical density, sophistication, and diversity, all of which were listed in Read’s (2000) sub-categories of lexical richness.

Expanding the Coverage of Complexity Measurement As shown in Figure 1, complexity is a construct that implies syntactic and lexical diversity and sophistication. Previous studies have focused on structural complexity for syntactic complexity and word-level complexity for lexical complexity.

However, regarding syntactic complexity, the diversity and sophistication of syntactic structures have rarely been measured (Yoon, 2017). Moreover, large-grained indices such as length-based indices have been criticized regarding the difficulty of interpreting them (Norris

- 26 -

Page 5: Constructing Measurement Models of L2 Linguistic

& Ortega, 2009; Ortega, 2012). For instance, Biber et al. (2011) reported that phrasal complexification is a more appropriate feature of academic writing than clausal complexification through comprehensive corpus analysis. Regarding lexical complexity, collocational complexity has rarely been assumed, but recent studies have used relatively novel indices that capture n-gram diversity and sophistication (e.g., Biber et al., 2004; Crossley et al., 2012; Kyle & Crossley, 2015).

Considering those concepts are originally supposed to be added to the measurement model of complexity, as part of the target construct it is worthwhile to construct and validate expanded measurement models of syntactic and lexical complexity. For these purposes, this study implements two state-of-the-art analyzers and extended sub-constructs explained below.

Tool for Automated Analysis of Syntactic Sophistication and Complexity

Kyle (2016) created TAASSC to analyze fine-grained syntactic features of learners’ performance, such as fine-grained noun phrase complexity, fine-grained clausal complexity, and syntactic sophistication and variation.

Fine-grained noun phrase complexity. TAASSC provides 66 types of noun phrasal indices and they can be divided into three types. The first type calculates the average number of dependents per each phrase type (e.g., nominal subjects) and for all phrase types. The second one quantifies the occurrence of particular dependent types (e.g., adjective modifiers) regardless of the type of noun phrase in which they occur. The final type computes the average occurrence of particular dependent types in particular types of noun phrases (e.g., adjective modifiers occurring in nominal subjects). Additionally, for the first type of indices, a standard deviation is calculated as a variation of phrasal structures.

Fine-grained clausal complexity indices. TAASSC has 31 fine-grained clausal complexity indices. Both finite and non-finite clauses are considered clauses by TAASSC. To prevent structures that inherently include more words (e.g., prepositional phrases) receiving more weight than those that do not (e.g., adjectives), TAASSC counts the length of clauses as the number of direct dependents per clause. Additionally, instead of grouping structures such as dependent clauses or complex nominals, TAASSC counts each type separately. Of the 31 indices, 29 calculate the average number of particular structures per clause while two calculate the total number of dependents per clause.

Syntactic sophistication and variation. In terms of cognitive linguistics, TAASSC operationalizes syntactic sophistication and variation with frequency scores and TTR of Verb Argument Construction (VAC) offered by the Corpus of Contemporary American English (COCA; Davies, 2009, 2010).

VAC in TAASSC is defined as a main verb and all the direct dependents it requires. In the sentence Taro eats fast, the main verb is eats, which takes two direct dependents: Taro and fast. Taro is the subject of the verb eats, and fast is an adverb that modifies eats. The clause Taro eats fast can be represented by the VAC nominal subject – verb – adverbial modifier

- 27 -

Page 6: Constructing Measurement Models of L2 Linguistic

(Kyle, 2016). VAC frequency indices are calculated for main verb lemmas, VACs, and their combinations. Frequency indices comprise the average frequency score of the target structure (e.g., a VAC) in a particular text. If a structure that occurs in a text does not exist in the reference corpus, then it does not count toward the index score. These indices indicate the frequency of the linguistic structures in a text in reference to their use in COCA.

TTR is also computed for main verb lemmas, VACs, and their combinations. A token count includes the total number of occurrences of a particular structure in a text. A type count consists of the total number of unique instances of a particular structure. These indices measure the diversity of syntactic structures used in a text.

These three categories of fine-grained and novel linguistic features have been employed recently in studies to predict L2 writing quality with human judgment. Such research has reported that the models outperformed traditional predictive models in terms of the proportion of variance explained (e.g., Kyle & Crossley, 2017, 2018). Tool for Automated Analysis of Lexical Sophistication

Kyle and Crossley (2015) developed TAALES, which provides indices of the compensatory notion of simple frequency (i.e., range) and multiword units (i.e., bigrams and trigrams). In this section, those categories, which are related to conventional lexical complexity constructs, are discussed.

Range as a compensatory norm for frequency. Because frequency measures are based on the number of times a word occurs in a corpus as a whole, they are not sensitive to how widely a particular word is used. Range norm indices operationalize how widely a word is used in an input text, usually by counting the number of documents in which the word occurs in a reference corpus. Range indices have been implemented in predictive models to estimate essay writing scores with human judgment and have shown a significant association with the scores (e.g., Kim et al., 2018; Kyle et al., 2018).

N-gram frequencies. Many researchers have begun to shift their focus from single lexical items to larger units (e.g., Biber et al., 2004). Recently, n-grams have proven their usefulness in detecting cross-linguistic difference (e.g., Kyle, Crossley, Dai, & McNamara, 2013) and predicting writing quality (Crossley et al., 2012). Crossley et al. (2012), for example, calculated a variety of frequency-based indices for bigrams and trigrams in written and spoken sub-corpora in the BNC. They found that indices that measured bigram frequency, bigram proportions, and bigram accuracy were predictive of human judgments of essay quality.

In TAALES, each index calculates the mean word or n-gram frequency and range scores. This is achieved by dividing the sum of the frequency and range scores for the words or n-grams in a text by the number of words or n-grams in that text that receive the scores.

Research Questions

This study aims to construct more appropriate measurement models of complexity by

- 28 -

Page 7: Constructing Measurement Models of L2 Linguistic

combining conventional subordinate constructs of syntactic and lexical complexity and sub-constructs that have not been measured or modeled. The current study is guided by the following two research questions (RQs): RQ1: What linguistic indices, including fine-grained and novel indices, form factor structures

in relation to conventional complexity indices? RQ2: To what extent do the subordinate constructs of syntactic and lexical complexity

contribute to the prediction of a writing score?

Method Essay Data

A total of 503 essays authored by Japanese university students were analyzed. The writers’ English proficiency was estimated to range from the Common European Framework of Reference for Languages (CEFR) level A2 to B1 based on their TOEFL ITP score distribution (M = 477.84, SD = 40.32; Tannenbaum & Baron, 2011). The essays were written under the time limit of 40 minutes, using the Criterion® editor without a thesaurus or grammatical check functions; the target was set at 200 words (M = 209.57, SD = 92.90). The essays were collected in three independent experiments and different topic was chosen from the Criterion topic list each time. The three tasks were consistent in that they were argumentative tasks and were all at the same “college level 1st year” level.

Essays were scored by the automated scoring engine e-rater® (Attali & Burstein, 2006) developed by the Educational Testing Service (M = 1.95, SD = 0.95). This system is trained with essay data evaluated by well-trained professional raters using rubrics, and it automatically gives essays 1 to 6 points as a holistic rating. The correlation between the score of the e-rater and the rating of the human evaluator is reported as .97 (Attali & Burstein, 2006, p. 22). Tools and Indices

TAASSC and TAALES were used to extract fine-grained and novel indices related to syntactic and lexical complexity. The former offers fine-grained noun phrase complexity indices, clausal complexity indices, and VAC indices. The latter provides indices that tap into the compensatory notion of simple frequency (i.e., range) and multiword units (i.e., bigrams and trigrams). In this study, indices meeting the definition of conventional linguistic complexity constructs were selected out of these three categories and used for analysis.

Fine-grained phrasal complexity indices. TAASSC provides 132 kinds of noun phrase complexity indices. Half of these count pronouns as noun phrases, while the other half do not. Noun phrases in English can consist of pronouns that do not take direct dependents. In order to avoid the confusion caused by skewness resulting from including indices involving pronouns, this study employs 66 indices that involve nouns as fine-grained phrasal complexity indices.

- 29 -

Page 8: Constructing Measurement Models of L2 Linguistic

Fine-grained clausal complexity indices. This analytic tool computes 31 types of clausal complexity indices, all of which were included.

VAC frequency and variation. VAC frequency indices comprise the average frequency score of the target structure (e.g., a VAC) in a particular text. If a structure that occurs in a text does not exist in the reference corpus (i.e., COCA), it is not counted as the index score. Regarding syntactic variation, TTR are also computed. In this research, indices for main verb lemmas were excluded because these measures overlap lexical diversity indices.

Word frequency and range. Word frequency and range indices in TAALES are computed based on several corpora. Since this study deals with written texts, sub-corpora consisting of only written texts in BNC, COCA, the Lorge’s corpus of popular magazine articles (Thorndike & Lorge, 1944), and the Brown corpus (Kučera & Francis, 1967) were employed. Each index calculates the mean frequency and range scores by dividing the sum of the frequency and range scores for the words in a text by the number of words in that text that receive a frequency and range score.

N-gram frequency and range. Frequency and range indices for bigrams and trigrams in TAALES are computed with BNC and COCA. Each index calculates the mean n-gram frequency and range scores by dividing the sum of the frequency and range scores for the n-grams in a text by the number of n-grams in that text that receive the scores. The proportion of n-grams in the target text that also occur frequently in the reference corpus is also computed.

Procedure

First, from the essay data, 183 index values (19 conventional syntactic complexity indices, 66 noun phrase complexity indices, 31 fine-grained clausal complexity indices, 24 VAC indices, seven conventional lexical complexity indices, eight word frequency and range indices, and 28 n-gram frequency and range indices) were calculated using L2 Syntactic Complexity Analyzer (SCA; Lu, 2010), Lexical Complexity Analyzer (LCA; Lu, 2012), Coh-Metrix 3.0 (McNamara, Graesser, McCarthy, & Cai, 2014), TAASSC, and TAALES.

Then, in order to avoid problems of multicollinearity, for combinations whose correlation coefficient between index values exceeded .90, only one index that has been used frequently in previous studies was sustained. Thereafter, the feasibility of factor analysis was tested and exploratory factor analysis was conducted. For the analysis, the psych package (Revelle, 2018) in the statistical development environment R (R Core Team, 2018) was utilized.

Confirmatory factor analyses (CFAs) were then conducted to determine whether the measurement models fit the data. As requisites for CFA, univariate and multivariate normality, parameter estimation method, model fit indices used, and sample size were confirmed. First, using the MVN package (Korkmaz, Goksuluk, & Zarasiz, 2014), univariate and multivariate normality were judged based on skewness and kurtosis of the variables and Mardia’s

- 30 -

Page 9: Constructing Measurement Models of L2 Linguistic

normalized estimate and were violated. Thus, the diagonal weighted least square method was used for parameter estimation. One of the factor loadings from each factor was fixed to 1.00 for scale identification. The model fit was checked by CFI of .90 or above (Arbuckle & Wothke, 1995), RMSEA of .08 or below (Browne & Cudeck, 1993), and SRMR of .08 or below (Hu & Bentler, 1999). The dataset did not have missing values. The sample size exceeded 200, which is considered large according to Kline’s (2005) guidelines.

Finally, in order to verify the extent to which the measurement model of the subordinate constructs of linguistic complexity can predict L2 writing quality, Structural Equation Modeling (SEM) analysis was conducted. The premise of this analysis and the evaluation of goodness of fit were the same as CFA and the model was constructed by the subordinate constructs of L2 complexity. The distribution of Criterion scores was positively skewed, and the number of grades was small; thus, it was treated as ordinal scale data.

Results Exploratory Factor Analysis

Before performing the analysis, observational variables that do not meet the criteria were excluded on the basis of the following principles: (i) rarely appear in the participants’ responses, or (ii) extremely similar to any one of the remaining variables. Based on the shapes of distribution of each index and descriptive statistics, 50 indices that rarely occur in the submitted essays were excluded from the analysis (e.g., verbal modifiers per indirect object). Referring to the Pearson product-moment correlation coefficients between the 133 index values, 26 pairs with a correlation coefficient exceeding .90 were confirmed. Therefore, 26 indices were excluded and a total of 107 indices (84 syntactic complexity indices and 23 lexical complexity indices) were employed for subsequent analyses.

Syntactic complexity. As for the feasibility of conducting factor analysis, KMO test and Bartlett’s sphericity test were performed. The KMO test score was .657 (“mediocre,” Kaiser, 1974) and was judged to be a valid value as a sample. In addition, the significance in Bartlett’s test (χ2 = 6877, df = 276, p < .001) indicated the correlation between variables rejected the null hypothesis. The number of factors to be extracted was set to three based on the VSS standard (Revell & Rocklin, 1979). As to factoring method, weighted least square method was used because almost all the variables were not normally distributed and multivariate normality was not confirmed based on Mardia’s multivariate kurtosis. Sixty items were removed from the model on the basis of the following predetermined criteria. Indices were removed from the model if they did not have primary factor loadings that were ≥ .40 or if the items loaded on more than one factor. A summary of the analysis is presented in Table 1.

The first factor seemed to capture phrasal complexity. This factor included 11 indices that operationalize phrasal elaboration, nominal modification, and their variation. Thus, this factor was interpreted as Phrasal Complexity. The second factor seemed to capture VAC frequency. Four positively loaded indices operationalize the frequency of VAC or verb-VAC

- 31 -

Page 10: Constructing Measurement Models of L2 Linguistic

combinations. Negatively loaded items were related to variation of VACs or verb-VAC combinations and grammatical structures per clause. A text that earns a high score for this factor will tend to include more frequent VACs or verb-VAC combinations, more repetitive use of those constructions, and less grammatical modifiers such as adverbs and prepositions. Thus, this factor was interpreted as VAC Sophistication. The third factor seemed to capture clausal complexification. All of the four indices loaded on this factor quantify clausal subordination and coordination; thus, this factor was interpreted as Clausal Complexity. To verify the internal consistency of each factor, Cronbach’s coefficient alpha (Cronbach, 1951) were computed: for Factor 1, α = .84; for Factor 2, α = .84; and for Factor 3, α = .83.

Lexical complexity. The KMO test score was .767 (“middling,” Kaiser, 1974) and was judged to be a valid value as a sample. In addition, the significance in Bartlett’s test (χ2 = 7872.5, df = 120, p < .001) indicated the correlation between variables rejected the null hypothesis. The number of factors to be extracted was set to three based on the VSS standard (Revell & Rocklin, 1979). As mentioned earlier, the distribution of the variables was unknown and multivariate normality was not confirmed, the weighted least square method was utilized for factoring. Seven items were removed from the model based on the following predetermined criteria. Indices were removed from the model if they did not have primary factor loadings that were ≥ .40 or if the items loaded on more than one factor. A summary of the analysis is presented in Table 2.

The first factor seemed to capture single word frequency. The rareness of word occurrences was evaluated with simple frequency and range scores derived from three different reference corpora. This factor was interpreted as Word Frequency. The second factor seemed to capture lexical diversity. All indices loaded on this factor reflect word variety computed with a whole sequential thread of words or random sampled words in a text. This factor was interpreted as Lexical Diversity. The third factor seemed to capture n-gram frequency. Of the four indices loaded on this factor, three indices quantify the frequency of bigram or trigram use in a text with reference corpora. This factor was interpreted as N-Gram Frequency. To verify the internal consistency of each factor, Cronbach’s coefficient alpha (Cronbach, 1951) were computed: for Factor 1, α = .93; for Factor 2, α = .93; and for Factor 3, α = .76. Confirmatory Factor Analysis

In order to verify whether the above-mentioned syntactic complexity and lexical complexity factor structures fit to the data, CFA was performed using the lavaan package (Rosseel, 2012) in R (R Core Team, 2018). The diagonal weighted least square method was applied to estimate parameters. Inter-variable Pearson product-moment correlations (r = −.60 to .79) for syntactic complexity and (r = −.67 to .89) were not high enough to cause multicollinearity (r = .90 or above; Tabachnick & Fidell, 2014).

- 32 -

Page 11: Constructing Measurement Models of L2 Linguistic

Table 1 Summary of Factor Loadings for Oblique Geomin-Rotated Three-Factor Solution With Conventional and Fine-Grained Syntactic Complexity Indices

Index Code Factor loading Commu-

nality 1 2 3 dependents per nominal Ph1 .88 −.04 .07 .77 CN/C Ph8 .65 .01 −.17 .46 dependents per nominal SD Ph5 .62 −.07 .26 .45 MLC Ph9 .61 .13 −.21 .45 prepositions per nominal Ph11 .58 −.03 −.08 .34 dependents per direct object Ph2 .54 −.04 −.06 .30 dependents per nominal subject Ph4 .53 .04 .01 .29 dependents per object of the preposition Ph3 .52 .09 .10 .30 dependents per nominal subject SD Ph7 .46 .01 .21 .25 dependents per object of the preposition SD Ph6 .44 .05 .14 .22 adjectival modifiers per nominal Ph10 .44 .05 −.11 .21 logged average VAC frequency VAC1 .03 .84 −.17 .71 dependents per clause VAC7 .05 −.77 .17 .61 VAC type-token ratio - all VAC3 −.04 −.71 −.27 .61 logged average verb-VAC combination frequency VAC4 −.02 .60 .15 .40 verb-VAC combination type-token ratio VAC5 .05 −.57 −.31 .46 % of verb-VAC combinations in text are in COCA VAC6 −.09 .52 −.01 .28 prepositions per clause VAC8 .14 −.50 −.06 .29 adverbial modifiers per clause VAC9 −.03 −.50 −.08 .26 average VAC frequency SD VAC2 .15 .45 −.20 .23 C/S Cl1 −.03 .04 .86 .74 DC/C Cl2 −.02 .03 .86 .74 VP/T Cl3 .06 −.13 .72 .55 subordinating conjunctions per clause Cl4 .03 .14 .46 .22

Factor correlations Factor 2 .12 –– Factor 3 −.01 −.09 –– Eigenvalue 3.81 3.56 2.77 % of variance 16% 15% 12% Cumulative % 16% 31% 43% Note. N = 503; Factor loadings ≥ .40 are in boldface. AW = all words; CN/C = complex nominals per clause; MLC = mean length of clause; C/S = clauses per sentence; DC/C = dependent clauses per clause; VP/T = verb phrases per T-unit.

- 33 -

Page 12: Constructing Measurement Models of L2 Linguistic

Table 2 Summary of Factor Loadings for Oblique Geomin-Rotated Three-Factor Solution With Conventional and Fine-Grained Lexical Complexity Indices

Index Code Factor loading Commu-

nality 1 2 3 BNC Written Range AW Wf2 .95 .04 −.02 .88 BNC Written Frequency AW Logarithm Wf1 .82 −.06 .25 .88 LS1 Wf7 −.82 .04 .32 .67 COCA Academic Range AW Wf5 .81 .05 .26 .80 BNC Written Bigram Proportion 50k Wf3 .79 .02 .16 .70 Kučera-Francis Register Range AW Wf6 .75 .11 −.13 .49 COCA Academic Frequency AW Logarithm Wf4 .72 −.05 .32 .77 LS2 Wf8 −.62 .07 .25 .41 HD-D Div4 .00 .94 −.04 .90 MTLD Div2 −.03 .93 .05 .87 MSTTR Div3 .07 .89 −.02 .75 UBER Div1 −.13 .73 .01 .61 COCA Academic Bigram Range Logarithm NGf1 .26 .01 .77 .76 BNC Written Bigram Frequency Logarithm NGf2 −.01 −.06 .76 .59 COCA Academic Frequency AW NGf4 .02 −.03 .57 .33 COCA Academic Trigram Range Logarithm NGf3 −.07 .01 .46 .20 Factor correlations Factor 1 –– Factor 2 −.34 –– Factor 3 .24 −.12 –– Eigenvalue 5.22 3.12 2.26 % of variance 33% 20% 14% Cumulative % 33% 52% 66% Note. N = 503; Factor loadings ≥ .40 are in boldface. AW = all words; LS1 = ratio of sophisticated word tokens; LS2 = ratio of sophisticated word types; HD-D = hypergeometric distributed D measure; MTLD = the measure of textual lexical diversity; MSTTR = mean segmental TTR; UBER = uber index.

Syntactic complexity. Firstly, CFA was performed for each factor obtained by EFA. Regarding Phrasal Complexity, correlations between uniqueness of indices operationalizing similar grammatical phenomena were estimated beforehand. Ph1, Ph2, Ph3, and Ph4 calculate the arithmetic means of the number of dependents within specific nominal structures. Since Ph1 subsumes Ph2, Ph3, and Ph4 in terms of the category of the target structure, correlations between Ph1 and Ph2, Ph3, and Ph4 were naturally assumed. Ph5, Ph6, and Ph7 calculate the standard deviation of the number of dependents within specific nominal structures. The same reason is true of correlations between these indices; thus, correlations between Ph5 and Ph6 and Ph7 were also assumed. Because Ph1, Ph3, and Ph4; and Ph5, Ph6, and Ph7 share the target linguistic structures, respectively, correlations between each pair were assumed. The

- 34 -

Page 13: Constructing Measurement Models of L2 Linguistic

model showed reasonable fit to the data, where CFI = .959, RMSEA = .075, and SRMR = .072.

Regarding Clausal Complexity, a uni-factor model that was obtained from EFA was verified. The model showed great fit to the data, where CFI = .997, RMSEA = .044, and SRMR = .032.

Regarding VAC Sophistication, error covariances of indices operationalizing similar grammatical phenomena were estimated beforehand. VAC1 and VAC2 share the target structures (i.e., the number of VACs), and an error covariance between them was assumed. The pairs VAC1 and VAC4; and VAC3 and VAC5 quantify similar structures by the same calculation method and thus, error covariances between them were assumed. One of the error covariances between VAC1 and VAC4 was not significant (p = .181), so this path was removed. The final model showed a reasonable fit to the data, where CFI = .963, RMSEA = .077, and SRMR = .078.

Each of the three factors was verified based on the above uni-factor CFA and the three-factor solution tested. The initial model that simply consisted of three uni-factor structures did not fit the data. Based on modification indices, since the amount of subordination (e.g., relative pronouns) would influence the number of dependents per nominal, paths were drawn from Clausal Complexity to Ph5 and Ph7. In addition, since the amount of clausal subordination and coordination would affect syntactic variation, paths were added from Clausal Complexity to VAC3 and VAC5. The final model showed a reasonable fit to the data, where CFI = .901, RMSEA = .071, and SRMR = .079. As competing models, the following two models were tested: (a) a single factor model in which all the three latent variables are fully correlated; and (b) an uncorrelated three-factor model in which all latent variables are not correlated. As a result, the former estimation did not converge, and thus comparison was impossible. For the latter, the goodness of fit was as follows: CFI = .869, RMSEA = .081, and SRMR = .088. Therefore, the proposed correlated three-factor was the best. Figure 2 shows the results.

- 35 -

Page 14: Constructing Measurement Models of L2 Linguistic

Figure 2. Final CFA model of supplemented syntactic complexity (N = 503). All of the path coefficients are standardized and significant (p < .001). Phrasal = Phrasal Complexity; Clausal = Clausal Complexity; VACsop = VAC Sophistication.

Lexical complexity. For Word Frequency, an error covariance between LS1 and that of LS2 was estimated because the calculation methods are quite similar. The model showed an excellent fit to the data, where CFI = .998, RMSEA = .027, and SRMR = .053. For N-Gram Frequency, the model showed great fit to the data, where CFI = .999, RMSEA = .000, and SRMR = .018. For Lexical Diversity, the model showed reasonable fit to the data, where CFI = .999, RMSEA = .000, and SRMR = .010.

Figure 3. Final CFA model of supplemented lexical complexity (N = 503). All of the path coefficients are standardized and significant (p < .001). Wfreq = Word Frequency; NGfreq = N-Gram Frequency; Diversity = Lexical Diversity.

Each of the three factors was verified based on the above uni-factor CFA and the three-factor solution was tested. For the three-factor solution of Lexical Complexity measurement model, the initial model that simply consisted of three uni-factor structures did not fit the data. Based on modification indices, since n-gram frequencies affect word

- 36 -

Page 15: Constructing Measurement Models of L2 Linguistic

frequencies, paths from N-Gram Frequency to Wf7 and Wf8 were set. The final model showed a reasonable fit to the data, where CFI = .978, RMSEA = .051, and SRMR = .072. To compare the goodness of fit of potential models, as in the syntactic complexity, a single factor model and an uncorrelated three-factor model were tested. As a result, again, the former did not converge. Regarding the latter, the goodness of fit was as follows: CFI = .718, RMSEA = .181, and SRMR = .191. Thus, the correlated three-factor was the most promising model. Figure 3 summarizes the results. Structural Equation Modeling

Finally, a model that predicts Criterion scores with subordinate constructs of syntactic and lexical complexity obtained from the previous analyses was built and tested. First, six latent variables were utilized as independent variables for predicting Criterion scores. However, four paths from VAC Sophistication (B = .078, p = .069), Word Frequency (B = −.073, p = .108), N-Gram Frequency (B = .065, p = .166), and Lexical Diversity (B = .073, p = .175) to Criterion scores were not significant, and these paths were removed. All paths in the final model then became significant. The model showed a reasonable fit to the data, where CFI = .916, RMSEA = .077, and SRMR = .079. Figure 4 shows the results.

Two regression paths from Clausal Complexity (B = .476, p < .001) and Phrasal Complexity (B = .336, p < .001) were significant and the latent variables positively associated with Criterion scores. The predictive SEM model accounted for 32.3% of the variance of Criterion scores.

Discussion Factor Structures Behind Conventional and Fine-Grained Indices The first purpose of this study is to investigate the factor structures of extended syntactic and lexical complexity measured by conventional and fine-grained linguistic features. Results of EFA and CFA indicate that both constructs would consist of three subordinate constructs.

- 37 -

Page 16: Constructing Measurement Models of L2 Linguistic

Figure 4. Structural equation model with supplemented syntactic and lexical complexity sub-constructs as predictors of Criterion scores (N = 503). All of the path coefficients are standardized and significant (p < .001). CS = Criterion scores.

As for lexical complexity, Word Frequency, N-Gram Frequency, and Lexical Diversity

were extracted as common factors. The Word Frequency factor includes conventional lexical sophistication indices (i.e., Wf7: LS1 and Wf8: LS2), and this factor would thus be in line with the norm of lexical sophistication. Inter-factor correlation between this construct and N-Gram Frequency were moderate (r = .50); the N-Gram Frequency factor extends to the traditional lexical sophistication norm. This result supports Kim et al. (2018), who distinctly extracted principle components of “Content Word Frequency” and “Bigram Frequency and Range.”

Writing Score Prediction With Supplemented L2 Complexity Sub-Constructs

The second aim of this study is to explore the degree of predictive performance for writing quality of latent variables related to linguistic complexity. The results of SEM suggest that supplemented Phrasal Complexity and Clausal Complexity are significantly associated with Criterion scores and accounted for 32.3% of variance of the score. The proportion of variance explained is larger than that of the model constructed by only conventional sub-constructs of linguistic complexity (29.8%; Kato & Ono, 2019). This result supports Biber et al.’s (2011) proposal regarding the necessity of considering phrase level linguistic elaborations, specifically noun phrase elaborations. In addition, the trend that the combined

- 38 -

Page 17: Constructing Measurement Models of L2 Linguistic

(i.e., both conventional and fine-grained indices used) predictive model accounts for a larger portion of variance of the writing score than a model consisting only of conventional indices has been reported by Kyle (2016) and Kyle and Crossley (2017).

On the other hand, four out of six latent variables are not associated with writing quality derived from an automated essay scoring system. Regarding Lexical Diversity, too strong robustness against length would lead to less association with essay scores. The other three latent variables mainly influence the use of common specific linguistic items (i.e., VAC Sophistication, Word Frequency, and N-Gram Frequency), which are opposite concepts from the previous approaches that respect low frequency. Although several studies confirm significant relationships between writing quality and VAC frequency (Kyle, 2016; Kyle & Crossley, 2017), and Word Frequency and N-Gram Frequency (Crossley et al., 2012; Kim et al., 2018; Kyle et al., 2018), the results of this study are different. There is a possibility that the two analyzers compute indices that tap into those constructs with reference to native speakers’ corpus (e.g., BNC and COCA), the proficiency level of writers or participants in the previous studies were higher than those in this study, and thus there is greater variance in the frequency scores. This may have contributed to the predictions in regression analyses.

Conclusion and Future Research In this study, in order to measure larger coverage of vast complexity constructs, attempts were made to integrate conventional indices and subordinate constructs of syntactic complexity and fine-grained and novel linguistic indices. Extracted factors were interpreted in line with previously proposed complexity constructs, and a predictive model consisting of these factors outperformed the model proposed in previous studies. The results also showed the direction of validating the measurement model of complexity (i.e., introducing finer-grained indices into the measurement model regarding theoretical backing) and indicated the necessity to develop more reliable indices for capturing EFL writers’ complexification in their performance. As with most research, this study has a number of limitations. First, all of the authors of essays used as data in this study belong to a single educational institution, which may limit the generalizability of the results. Additionally, their writing proficiency scores were relatively low and may cause a skewed distribution of the Criterion scores. While this study employed statistical methods that have robustness against skewed data, it may be necessary to include essays written by more proficient language learners in order to obtain more valid results.

Another limitation that could be addressed in future studies is the source of the writing scores. In this study, the e-rater was employed as the sole rater to assign scores to each essay, but this tool does not fully reflect human judgment. Thus, a more detailed exploration of the relationship between human perception and conventional and fine-grained complexity indices on essays would be necessary.

- 39 -

Page 18: Constructing Measurement Models of L2 Linguistic

Acknowledgements I would like to thank Professor Yuichi ONO and my colleagues for their helpful advice.

I extend my gratitude to two anonymous reviewers for their insightful comments to improve this paper.

References Arbuckle, J. L., & Wothke, W. (1995). Amos 4.0 use’s guide. Chicago, IL: SmallWaters

Corporation. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V.2. Journal of

Technology, Learning, and Assessment, 4(3). Retrieved from https://ejournals.bc.edu/ index.php/jtla/article/view/1650

Bardovi-Harlig, K. (1992). A second look at T-unit analysis: Reconsidering the sentence. TESOL Quarterly, 26, 390–395. https://doi.org/10.2307/3587016

Biber, D., Conrad, S. M., Reppen, R., Byrd, P., Helt, M., Clark, V., ... Urzua, A. (2004). Representing language use in the university: Analysis of the TOEFL 2000 spoken and written academic language corpus (ETS Research Memorandum RM-04-03). Princeton, NJ: Educational Testing Service.

Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly, 45, 5–35. https://doi.org/10.5054/tq.2011.244483

Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation modeling (pp. 136–162). Newbury Park, CA: Sage.

Bulté, B., & Housen, A. (2012). Complexity, accuracy, and fluency: Definitions, measurement and research. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 performance and proficiency: Complexity, accuracy, and fluency in SLA (pp. 21–46). Amsterdam, The Netherlands: John Benjamins.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. https://doi.org/10.1007/BF02310555

Crossley, S. A., Cai, Z., & McNamara, D. (2012). Syntagmatic, paradigmatic, and automatic n-gram approaches to assessing essay quality. In G. M. Youngblood & P. M. McCarthy (Eds.), Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference (pp. 214–219). Florida, FL: AAAI Press.

Daller, H., & Xue, H. (2007). Lexical richness and the oral proficiency of Chinese EFL students. In H. Daller, J. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing vocabulary knowledge (pp. 150–164). Cambridge University Press.

Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14, 159–190. https://doi.org/10.1075/ijcl.14.2.02dav

Davies, M. (2010). The Corpus of Contemporary American English as the first reliable

- 40 -

Page 19: Constructing Measurement Models of L2 Linguistic

monitor corpus of English. Literary and Linguistic Computing, 25, 447–464. https://doi.org/10. 1093/llc/fqq018

Ellis, R., & Barkhuizen, G. (2005). Analysing learner language. Oxford University Press. Housen, A., Kuiken, F., & Vedder, I. (2012). Complexity, accuracy, and fluency: Definitions,

measurement, and research. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 performance and proficiency: Complexity, accuracy, and fluency in SLA (pp. 1–20). Amsterdam, The Netherlands: John Benjamins.

Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.

Jarvis, S. (2013). Capturing the diversity in lexical diversity. Language Learning, 63, 87–106. https://doi.org/10.1111/j.1467-9922.2012.00739.x

Kato, T., & Ono, Y. (2019). Constructing a measurement model of L2 complexity in automated essay scoring for Japanese EFL learners’ writing: Toward a qualitative and analytic evaluation. Proceedings of Society for Information Technology & Teacher Education International Conference (SITE 2019), 1351–1356.

Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrica, 39, 31–36. https://doi.org/ 10.1007/BF02291575

Kim, M., Crossley, S. A., & Kyle, K. (2018). Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. Modern Language Journal, 102, 120–141. https://doi.org/10.1111/modl.12447

Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd. ed.). New York, NY: Guilford Press.

Korkmaz, S., Goksuluk, D., & Zarasiz, G. (2014). MVN: An R package for assessing multivariate normality. The R Journal, 6, 151–162. https://doi.org/10.32614/rj-2014-031

Kučera, H., & Francis, W. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.

Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication (Unpublished doctoral dissertation). Georgia State University, Atlanta, GA.

Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49, 757–786. https://doi.org/10.1002/tesq.194

Kyle, K., & Crossley, S. A. (2017). Assessing syntactic sophistication in L2 writing: A usage-based approach. Language Testing, 34, 513–535. https://doi.org/10.1177/026553221771 2554

Kyle, K., & Crossley, S. A. (2018). Measuring syntactic complexity in L2 writing using fine ‐

grained clausal and phrasal indices. Modern Language Journal, 102, 333–349. https://doi.org/10.1111/modl.12468

Kyle, K., Crossley, S. A., & Berger, C. (2018). The tool for the automatic analysis of lexical

- 41 -

Page 20: Constructing Measurement Models of L2 Linguistic

sophistication (TAALES): version 2.0. Behavior Research Methods, 50, 1030–1046. https://doi.org/10.3758/s13428-017-0924-4

Kyle, K., Crossley, S. A., Dai, J., & McNamara, D. S. (2013). Native language identification: A key n-grams approach. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 242–250). Atlanta, GA: Association for Computational Linguistics.

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15, 474–496. https://doi.org/10.1075/ijcl.15. 4.02lu

Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners’ oral narratives. Modern Language Journal, 96, 190–208. https://doi.org/10.1111/j.1540-4781.2011.0123 2_1.x

McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

Meara, P. (1996). The dimensions of lexical competence. In G. Brown, K. Malmkjaer, & J. Williams (Eds.), Performance and competence in second language acquisition (pp. 35–53). Cambridge University Press.

Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol, England: Multilingual Matters.

Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30, 555–578. https://doi. org/10.1093/applin/amp044

Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24, 492–518. https:// doi.org/10.1093/applin/24.4.492

Ortega, L. (2012). Interlanguage complexity: A construct in search of theoretical renewal. In B. Kortmann & B. Szmrecsanyi (Eds.), Linguistic complexity: Second language acquisition, indigenization, contact (pp. 127–155). Berlin, Germany: De Gruyter.

Pallotti, G. (2009). CAF: Defining, refining and differentiating constructs. Applied Linguistics, 30, 590–601. https://doi.org/10.1093/applin/amp045

R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

Read, J. (2000). Assessing vocabulary. Cambridge University Press. Revelle, W. (2018). psych: Procedures for personality and psychological research. Evanston,

IL: Northwestern University. Revelle, W., & Rocklin, T. (1979). Very Simple Structure-alternative procedure for estimating

the optimal number of interpretable factors. Multivariate Behavioral Research, 14, 403–414. https://doi.org/10.1207/s15327906mbr1404_2

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. https://doi.org/10.18637/jss.v048.i02

- 42 -

Page 21: Constructing Measurement Models of L2 Linguistic

Saito, K., Webb, S., Trofimovich, P., & Isaacs, T. (2016). Lexical profiles of comprehensible second language speech. Studies in Second Language Acquisition, 38, 677–701. https://doi.org/10.1017/S0272263115000297

Sakuragi, T. (2011). Fukuzatsusa, seikakusa, ryuchosashihyo no koseigainendatousei no kensho [The construct validity of the measures of complexity, accuracy, and fluency: Analyzing the speaking performance of learners of Japanese]. JALT Journal, 33, 157–174.

Skehan, P. (2003). Task-based instruction. Language Teaching, 36, 1–14. https://doi.org/10. 1017/S026144480200188X

Tabachnick, B. G., & Fidell, L. S. (2014). Using multivariate statistics. Harlow, UK: Pearson Education.

Tannenbaum, R. J., & Baron, P. A. (2011). Mapping TOEFL® ITP scores onto the Common European Framework of Reference (ETS Research Memorandum RM-11-33). Princeton, NJ: Educational Testing Service.

Thorndike, E. L., & Lorge, I. (1944). The teacher’s word book of 30,000 words. New York, NY: Columbia University Press.

Wolfe-Quintero, K., Inagaki, S., & Kim, H-Y. (1998). Second language development in writing: Measures of fluency, accuracy, & complexity. Honolulu, HI: Second Language Teaching & Curriculum Center, University of Hawai’i at Manoa.

Yoon, H. (2017). Investigating the interactions among genre, task complexity, and proficiency in L2 writing: A comprehensive text analysis and study of learner perceptions (Publication No. 10275616) [Doctoral dissertation, Michigan State University]. ProQuest Dissertations & Theses Global.

- 43 -