第二言語スピーキング・ライティングにおけるタス …koizumi/LT2016_Innami_Koizumi_Task...Since he focused on task and person-by-task interactions, rater facets

Running head: TASK AND RATER EFFECTS 1

The final, definitive version of this paper has been published in Language Testing, 33(3), July

2016 published by SAGE Publishing, All rights reserved.

In’nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A

synthesis of generalizability studies. Language Testing, 33, 341-366.

doi:10.1177/0265532215587390

(SAGE Publications, UK & USA)

[第二言語スピーキング・ライティングにおけるタスクと評価者の影響：一般化可能

性理論を使用した研究の統合]

[Dataset for the synthesis (information on studies included and excluded in the synthesis);

The same file is also found in IRIS (http://www.iris-database.org/iris/app/home/index)]

Innami_Koizumi_Synthesis_G_study.xls

http://www.iris-database.org/iris/app/home/index


Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies

Yo In’nami

Chuo University, Japan

Rie Koizumi

Juntendo University, Japan

Abstract

We addressed Deville and Chalhoub-Deville’s (2006), Schoonen’s (2012), and Xi and

Mollaun’s (2006) call for research into the contextual features that are considered related to

person-by-task interactions in the framework of generalizability theory in two ways. First, we

quantitatively synthesized the generalizability studies to determine the percentage of variation

in L2 speaking and L2 writing performance that was accounted for by tasks, raters, and their

interaction. Second, we examined the relationships between person-by-task interactions and

moderator variables. We used 28 datasets from 21 studies for L2 speaking, and 22 datasets

from 17 studies for L2 writing. Across modalities, most of the score variation was explained

by examinees’ performance; the interaction effects of tasks or raters were greater than the

independent effects of tasks or raters. Task and task-related interaction effects explained a

greater percentage of the score variances, than did the rater and rater-related interaction

effects. The variances associated with the person-by-task interactions were larger for


assessments based on both general and academic contexts, than for those based only on

academic contexts. Further, large person-by-task interactions were related to analytic scoring

and scoring criteria with task-specific language features. These findings derived from L2

speaking studies indicate that contexts, scoring methods, and scoring criteria might lead to

varied performance over tasks. Consequently, this particularly requires us to define constructs

carefully.

Keywords

generalizability theory, research synthesis, L2 speaking, L2 writing, task, rater

Corresponding author

Yo In’nami, Division of English Language Education, Faculty of Science and Engineering,

Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan.

E-mail: [email protected]

1 Introduction

Assessing speaking and writing proficiency in performance testing presents numerous

challenges because, in addition to the examinees’ speaking and writing proficiency, many

variables are involved in the assessment process, including variability in assessment tasks and


rater judgments. Speaking and writing tasks produce performance variability due to a number

of variables such as task characteristics, topics, and prompts (e.g., Lee, Breland, & Muraki,

2004; Lumley & O’Sullivan, 2005; Skehan, 1998). Further, even after raters have been

trained to apply the same scoring criteria and level of severity consistently, raters exhibit

variations in severity and consistency (e.g., Weigle, 1998; Knoch, 2011). The impact of tasks

and raters has been widely investigated (e.g., Grabowski, 2009; Barkaoui, 2007) since

examinees’ performance on tasks is evaluated by raters, and subsequent scores form the basis

for all analysis and interpretation. Task and rater impacts have been examined mainly through

the many-facet Rasch measurement (e.g., Eckes, 2011; Weigle, 1998) and generalizability (G)

theory (e.g., Gebril, 2009; Lee, 2005). While the former enables the adjustment of scores and

investigations of differences in task difficulty, rater severity, and consistency, G theory can

decompose the score variances into those affected by the numerous factors and their

interactions. While an individual generalizability study provides valuable information on

variations in performance in the study’s context, a synthesis of previous generalizability

studies offers deep insights into overall trends across studies, which can be informative for

test developers and researchers. Such a synthesis is particularly valuable given that many

variables affect each other in intricate ways in speaking and writing assessments (e.g.,

Brennan, 1996; Xi & Mollaun, 2006). Moreover, a synthesis allows for a systematic

examination of the effects of variables by classifying them according to moderator (i.e.,


contextual) variables, even when such effects were not examined in the original studies.

Using this method, we will examine the magnitude of the impact of tasks and raters on L2

speaking and writing performance.

2 Literature Review

2.1 Task and Rater Effects in L2 Speaking and Writing in G Theory

G theory is a flexible, statistical framework for the systematic investigation of the

score variability in instruments under specific conditions by considering multiple sources of

error (Shavelson & Webb, 1991). G theory allows us to investigate, in a single analysis, the

independent and interactive effects of various factors such as examinees (persons), raters, and

items/tasks. For example, it is possible to determine what percentage of the variance of test

scores is due to the factors associated with persons or person-by-task interactions. Many

studies have used G theory to examine task and rater effects in L2 speaking (e.g., Stansfield

& Kenyon, 1992a, 1992b) and L2 writing research (e.g., Barkaoui, 2007; Wang, 2010). For

example, Barkaoui (2007) examined the impact of rating scale type (holistic vs. multiple

traits) on the L2 writing performance ratings of untrained raters. With regard to the holistic

scale, most of the variance was attributable to the person-by-task-by-rater interactions

(61.59%), followed by persons (26.23%), raters (5.73%), person-by-task interactions (5.01%),

and task-by-rater interactions (1.43%), with no impact for tasks and person-by-rater


interactions (both 0.00%).1 The results from averaged multiple traits scales showed that the

largest amount of the variance was explained by raters (58.18%), followed by person-by-task-

by-rater interactions (37.58%), and persons (4.24%). Variances attributable to tasks and

person-by-task, person-by-rater, and task-by-rater interactions were negligible (0.00% for

each of these effects). This indicates that holistic scoring produces more reliable results than

multiple trait scoring when raters are not trained.

Previous generalizability studies showed that person-by-task interaction effects are

relatively large in general (C. Huang, 2009) in addition to large person effects. However, it is

still unclear to what extent task- and rater-related effects explain the score variance, and

which effects—task and task-related interaction effects, or rater and rater-related interaction

effects—explain more of the score variances (which leads to Research Question 1).

2.2 Person-by-Task Interactions

The relatively large person-by-task effects observed in previous studies (C. Huang,

2009) indicate that examinees tend to perform better or worse depending on the tasks. Thus,

the quality of the examinees’ performances is inconsistent, and is rank-ordered differently for

different tasks. The factors involved in these performance variabilities may include curricula,

textbooks, teachers, learning conditions, and individual differences in learning styles and

1 All of the percentages of the variance components reported here were calculated based on the variance

components reported in Barkaoui (2007). See Section 3.3 Analyses for details.


personality (Brown, 2011).

While the person-by-task effects have been viewed as an undesirable source of score

variability, Deville and Chalhoub-Deville (2006) take the issue further and argue that these

interactions are related to the construct, and suggest the need to refine our construct definition

(see, for construct definition, Bachman, 2007; Chapelle, 1998; Xi, 2015). Schoonen (2012)

expands this line of discussion and argues that if we pay more attention to person-by-task

interactions, we could define a construct in ways that do not produce large person-by-task

effects. For example, he hypothesized that if writing a narrative is a construct different from

writing a letter of application, a generalizability study that only includes tasks for the latter

should not produce large person-by-task effects. If tasks for both constructs are included,

results may show a larger person-by-task effect because both domains are different, and

inconsistent performance is expected. To understand the person-by-task effects better, Deville

and Chalhoub-Deville (2006) and Schoonen call for research into the contextual features that

relate to score variability, including person-by-task effects.

Xi and Mollaun (2006) suggest two possible contextual features affecting the degree

of person-by-task interactions: (a) highly contextualized task characteristics, such as task

types (independent vs. integrated) and contexts (general vs. academic), and (b) types of rating

scales (analytic vs. holistic). They speculated that the person-by-task interactions would be

larger for assessments using both independent and integrated tasks, and those based on both


general and academic contexts, than for those using either one of the task types or based on

any one context. By comparing their results with Lee’s (2005) findings, they also speculated

that task-specific features in analytic scoring are likely to contribute to larger person-by-task

interactions than those in holistic scoring, because these features in holistic scoring are

“considered globally with other more stable features,” and produce smaller person-by-task

interactions than analytic scoring does (Xi & Mollaun, p. 40). In addition, Fulcher (2003)

argued that very different tasks and task-specific rating scales produce large task-related

variances. However, as these are only speculations, Xi and Mollaun called for research into

variables related to person-by-task interactions (which leads to Research Question 2).

We believe that this can be particularly well suited for a research synthesis (In’nami

& Koizumi, 2014; Norris & Ortega, 2006) in which the contextual variables contributing to

person-by-task effects are more likely to be identified using systematically collected previous

studies. Researchers may obtain useful insights into the contextual variables related to a

universe of generalization, and a general trend. In this context, we are aware of two studies

synthesizing sources of variability in L2 generalizability studies.

C. Huang (2009) meta-analyzed the generalizability of the performance assessments

in education and psychology and reported, on average, a negligible effect of task, and a large

person-by-task interaction in L2 learning (k = 9; 3.65% and 15.06%) and L1 writing (k = 11;

2.99% and 27.46%). Since he focused on task and person-by-task interactions, rater facets (as


well as persons) were not coded, and their relative effects were unknown. Huang also showed

that the variances of the person-by-task interactions were associated with variables such as

scoring methods. For instance, holistic scoring produced a greater amount of variance than

analytic scoring did, which contrasts the findings of Xi and Mollaun (2006). As Huang

focused on summarizing studies in education and psychology, he combined all the L2 studies

that he collected and did not classify them according to skills, which may have led to the

diverse results obtained.

Brown (2011) also synthesized L2 generalizability studies and reported that person-

by-task interactions explained from 0.45% to 39.06% of the variance in L2 performance tests,

exceeding the person-by-rater interactions in most (k = 10/13, 76.92%) of the studies. The

percentages of variance attributable to persons, tasks, and raters were 1.45% to 86.96%,

0.30% to 25.10%, and 0.00% to 61.10%, respectively. Although rater effects generally seem

to be more influential than task effects, their relative effects differ across studies. Brown

covered a wide range of areas of L2 testing, including the four skills, as well as multiple-

choice and performance tests. His method, however, seems to have some room for

improvement for four reasons: A comprehensive search for existing studies was not

conducted; studies such as those reporting only D studies were not included; the results were

not aggregated numerically; and studies across random and fixed effects designs were mixed.

The relatively large person-by-task interactions from these two studies indicate a


limited degree of across-task generalizability of the examinees’ performance. What remains

unknown are the relative effects of task, rater, and their interactions in L2 speaking and L2

writing and, more importantly, the relationship between person-by-task interactions and the

moderator variables that previous studies deemed worth investigating. Such an investigation

is likely to benefit from more systematically collected previous studies.

2.3 Current Study

We examined generalizability studies to quantitatively synthesize the percentage of

variation in L2 speaking and writing performance that is accounted for by tasks, raters, and

their interactions. Tasks and raters are two widely studied variables of paramount interest and

importance to language testers. We also examined the relationships between person-by-task

interactions, which previous studies suggest are relatively large, and moderator variables of

interest based on previous studies (i.e., task types, contexts, and scoring methods). We

examined two research questions in L2 speaking and writing:

1. Which effects—task and task-related interaction effects or rater and rater-related

interaction effects—explained more of the score variances?

2. Are the degrees of the person-by-task interactions related to task types, contexts, and

scoring methods?

Our synthesis of previous studies differs from C. Huang (2009) and Brown (2011) on two

main points: (1) based on searching methods of research synthesis (In’nami & Koizumi 2010;


Oswald & Plonsky, 2010), we collected previous L2 speaking and writing studies in a more

systematic manner; and (2) we examined a general trend in the effects of persons, tasks, raters,

and moderator variables, while classifying the results by G-study design and L2 skills.

3 Method

3.1 Data Collection

To find potential generalizability studies in language testing and learning, we

conducted an extensive search in May 2012, using three methods. First, we followed a

published method (In’nami & Koizumi, 2009) and retrieved studies through computer

searches on databases: the Educational Resources Information Center (ERIC), FirstSearch,

Google, Linguistics and Language Behavior Abstracts (LLBA), MLA International

Bibliography, ProQuest, PsycINFO, ScienceDirect, Scopus, and Web of Science. We used the

following keywords: generalizability/generalisability/g theory,

generalizability/generalisability/g study/studies, performance/alternative/authentic

assessment, task, and rater. This list of keywords was constructed based on the keywords and

synonyms retrieved from the thesauruses supplied in databases, books and articles reviewed,

authors’ experiences, and feedback from colleagues. Abstract, title, and article keyword

searches were used. Date range restriction was not imposed.

Second, books and journals in language testing, second language acquisition, and

educational measurement were reviewed. The books were those listed in In’nami and


Koizumi (2009), with recent additions: Bachman and Palmer (2010), Fulcher and Davidson

(2012), the ILTA Bibliography of Language Testing (Brown, 2012), the ILTA Bibliography

of PhDs in Language Testing (2011), Shohamy and Hornberger (2010), and the Cambridge

Applied Linguistics series (e.g., Assessing Speaking by Luoma, 2004). The books in second

language acquisition primarily included Ellis (2008), Ortega (2008), and Robinson (2012).

Different editions of the same book were also checked. We also included 26 journals listed in

In’nami and Koizumi (2009) with the following additions: Applied Language Learning,

Assessing Writing, Foreign Language Annals, International Journal of Applied Linguistics,

International Review of Applied Linguistics in Language Teaching, Language Learning &

Technology, and Language Teaching Research.

Third, the relevant studies were searched through communication with other

researchers. In each of the three approaches, the reference list of every paper and chapter,

both published and unpublished, was scrutinized for additional relevant materials.

3.2 Criteria for the Inclusion of a Study

The literature search retrieved approximately 650 studies. Their titles, abstracts, and

study descriptors were inspected to check if they met the following criteria: (a) the study used

G theory, and (b) the test was designed to elicit a certain length of L2 self-created speaking

and/or writing performance (i.e., one or more sentences). Studies that used only reading-

aloud tasks or L1-to-L2 translation tasks were excluded. When two papers used the same data


and designs, we selected the one with more information (e.g., we selected Xi & Mollaun,

2006 instead of Xi, 2007). At this stage, we narrowed down the studies to 45. A sample of

22.22% (n = 10) of the 45 studies was independently examined by both authors to determine

if they met the abovementioned criteria. The agreement percentage was 95, and the kappa

coefficient was 0.85. Disagreement was resolved through discussion. The remaining studies

were examined by the first author.

The studies were further inspected for if (a) persons were modeled as the object of

measurement; (b) all facets of measurement in a study (e.g., tasks, raters, etc.) were modeled

as random, not fixed, because fixed models do not have the aim of generalizing beyond the

condition of each facet (thus, Schoonen, 2005, for example, was excluded); (c) the variance

components or the percentage of variances explained for each facet for a G-study (or a D-

study with the number of each facet) was reported; and (d) moderator variables were reported,

including a G-study design (e.g., person-by-task, person-by-rater; crossed or nested;

univariate or multivariate), skill assessed (i.e., speaking and/or writing), and scoring methods

(i.e., holistic or analytic [including multiple traits]). We defined the task broadly, and included

direct and semi-direct formats. Task types and contexts were judged by examining sample

tasks and/or their descriptions provided in the paper. Integrated tasks were defined as tasks

with input texts that excluded task instructions, and academic tasks were defined as those

with topics/contents related to university studies. Table 1 shows the list of moderator


variables coded (see lines below Number of studies). When necessary, every effort was made

to contact the authors to ask for details. A sample of 22.22% (n = 10) of the 45 studies was

separately examined by both the authors for the four elements mentioned above. The

agreement percentage was 98, and the kappa coefficient was 0.90. Disagreement was

resolved through discussion. The first author investigated the remaining studies. Only the

studies that met all four conditions were used for the analysis. We did not include Van Moere

(2006) and Bolus, Hinofotis, and Bailey (1982) because they used different designs from the

remaining datasets and only one dataset was provided for each design (i.e., person-by-rating-

by-occasion and person-by-rater-by-occasion designs, respectively); these factors precluded a

synthesis of the results. For the same reason, we did not include studies with nested facets

(e.g., Fulcher, 1993). Results from studies that used many-facet Rasch measurement were

excluded because they provided variances explained by persons, tasks, and raters, but not

those explained by task- and rater-related interactions (see Linacre, 2013). Included in our

synthesis were 28 datasets from 21 L2 speaking studies, and 22 datasets from 17 L2 writing

studies (50 datasets from 36 studies, since two studies had both speaking and writing

analyses). Those 36 studies are marked with an asterisk (*) in the references section.

[Insert Table 1 about here]

These 36 studies had one or more of the following facets: task, rater, rating, and

scoring criterion. A facet of rating is used as a proxy for a rater, for example, when a response


from a single examinee is scored by multiple raters (e.g., Raters A and B), and another

response of the same examinee by multiple, different raters (e.g., Raters C and D); ratings

from Raters A and C are considered as Rating 1, as if both raters are exchangeable, and

ratings from Raters B and D are considered as Rating 2, as if both raters are exchangeable.

Subsequently, they are analyzed as a facet of rating (see Brown, 2011; Lee & Kantor, 2005).

The rating method is conceptually simpler by assuming that scores would be similar across

raters (Lin, 2014). Scoring criteria indicate the criteria in analytic scales, such as fluency and

accuracy.

3.3 Analyses

The analyses consisted of three stages. First, we used G-study design as a unit of

analysis. For example, when the results were reported in one article for person-by-task and

person-by-rater designs, both were coded. The results were not combined across designs or

skills (i.e., speaking and writing) because C. Huang (2009) reported that across-design and

across-skill aggregations affect the magnitude of variance components, and make the results

difficult to interpret. When multiple results from the same design were reported in one article,

one of the results was randomly selected. When results from proficiency-classified and

whole-level data were both reported, the whole-level data were used, since they are

considered more representative. Since we combined studies according to G-study design, in

which each dataset contributed only once, data dependency effects were kept to a minimum.


Second, for each dataset, we coded the values of variance components from the

studies. When the reported values were from D studies, we multiplied the values by the

number of levels (e.g., the person-by-task-by-rater variance component was multiplied by the

number of tasks and raters). Thus, values in our synthesis mean those per single facet (e.g.,

per task and per rater), and effects of the number of levels in each facet (e.g., the number of

tasks and raters) are controlled. For standardization, we then calculated the percentage of

variance component for each facet. We calculated sample-size-weighted means and based our

interpretations on them because they are more precise than arithmetic means (e.g., C. Huang,

2009; Lipsey & Wilson, 2001). We also computed standard deviations (SDs) of these

percentages across datasets that employed the same design. We restricted our analysis to

descriptive statistics. Inferential statistics (e.g., homogeneity Q or I2 statistics) were not used

because of the small number of studies. We did not calculate confidence intervals for the

means because our datasets did not satisfy the normality assumption that is required to

compute standard errors (C. Huang, 2009). Moreover, confidence intervals were not

calculated because we obtained a small number of datasets, which hampered the use of

resampling methods.

Table 2 shows an example of synthesis of the variance components using the person-

by-task design datasets in speaking. We found four studies and calculated the percentage of

variance for each variance component, for each dataset, and then calculated the descriptive


statistics. For example, the sample-size-weighted mean for person was 74.00% ([72.76 ×

1766 + 73.83 × 261 + 70.90 × 49 + 88.86 × 160]/[1766 + 261 + 49 + 160]), and the SD was

8.27%.


Third, for person-by-task interactions, we computed means and SD of the percentages

of variance components for each moderator variable—task types, contexts, and scoring—to

examine the relationship between these percentages and moderator variables.

4 Results

4.1 Characteristics and Examples of the Datasets

The characteristics of the datasets and moderator or contextual variables considered to

affect the percentage of variance components are summarized in Table 1. The sample size

varied for speaking and writing, which was reflected in the large SDs and minimum-

maximum ranges. More tasks and raters were used in speaking than in writing, particularly

because of the large-scale speaking studies such as Banno (2008) and Lee, Golub-Smith,

Payton, and Carey (2001).2 The number of scoring criteria was larger for speaking studies

because Kondo (2010) included 24 criteria as a scoring criterion facet. While not shown in

2 For example, Banno (2008) used 130 raters to examine rater training effects and differences

in the ratings of teachers versus non-teachers. If Banno was excluded, the average number of

raters shrank to 5.38―a more typical number of raters used in such studies (see Table 1 Note

a).


Table 1, examples of speaking tasks included TOEFL iBT independent and integrated tasks

(Lee, 2005) and integrated, role-play tasks (Sawaki, 2007). Examples of writing tasks

included independent, letter- and story-writing tasks (Bae & Bachman, 2010) and

independent, argumentative essay writing tasks (Barkaoui, 2007). For rater type and the

presence of rater training, see the rightmost column in the Appendix.

4.2 L2 Speaking

4.2.1 Percentages of Variance Component Explained by Each Facet

Table 3 shows the percentage of variance component explained by each facet, and the

interaction for speaking performance when one task, one rater, and/or one scoring criterion

were used. Large values are preferable for persons, while small values are usually preferable

for all other facets. The variance components in the person-by-task design were mainly

attributable to persons (74.00%) and person-by-task interactions with undifferentiated errors

(25.38%), while the remaining small percentage of variance was explained by tasks (0.63%).

Wider SDs of persons (8.27) and person-by-task interactions (7.84), than of tasks (0.50),

suggest more variability in such sources of variance. Please note that each design models

different sources of variance, and is not directly comparable unless the designs model the

same facet (e.g., the person x task design and the person x rater design are comparable only in

terms of persons). Results were similar for the person-by-rater and person-by-rating designs,

both of which showed that the variances were substantially attributable to persons rather than


the person-by-rater or person-by-rating interactions. Tasks and raters (or ratings) explained a

small percentage of variance (0.63%–3.32%), suggesting that tasks were of similar difficulty

and that raters rated similarly.


More complete pictures emerged from the remaining designs that operationalize both

task and rater facets—person-by-task-by-rater, person-by-task-by-rating, and person-by-rater-

by-criterion designs. Approximately 40% to 70% of the variance, across the three designs,

was attributable to persons. Raters contributed to some noticeable percentage of the variance

for the person-by-rater-by-criterion design (12.13%). This was rather surprising since the

raters or ratings were responsible for only a small amount of the total variance for the other

two designs that had a rater facet (0.05% and 1.91%).

Additionally, we observed noticeable interaction effects, including the person-by-task

interaction for the person-by-task-by-rater and person-by-task-by-rating designs (9.78% and

13.07%, respectively) and the person-by-rater interaction for the person-by-rater-by-criterion

design (10.57%). The three-way interactions including the undifferentiated errors explained

about 15.61% to 24.90% of the variances. Thus, rather than the tasks or raters per se, the

interaction effects involving persons, tasks, and raters contributed to the score variances.

When we compared task and rater effects in the person-by-task-by-rater and person-

by-task-by-rating designs, we found, on average, that the task effects were larger than the


rater (or rating) effects in the person-by-task-by-rater (2.06% vs. 1.91%) and person-by-task-

by-rating (0.70% vs. 0.05%; both are negligibly small) designs. Further, the person-by-task

interaction effects were larger than the person-by-rater (or rating) interaction effects (9.78%

vs. 1.85%; 13.07% vs. 6.38%).

The person-by-rater-by-criterion design was different from the other designs with

reference to the noticeably large rater and person-by-rater effects observed. In this design, the

rater and rater-related interaction effects contributed more to the score variances than the

scoring criterion and criterion-related interaction effects (12.13% vs. 4.09%; 10.57% vs.

4.99%). The distinctive pattern of larger rater-related effects in this person-by-rater-by-

criterion design than in the other designs deserves attention and is interpreted in the

Discussion section.

4.2.2 Results From Moderator Variable Analysis

To identify possible causes of large variations reported in the previous section, we

conducted a moderator variable analysis on the person-by-task interactions for the person-by-

task-by-rater and the person-by-task-by-rating designs, both of which could separate person-

by-task interactions from undifferentiated errors, unlike the person-by-task design. We

classified results according to three moderator variables: task types, contexts, and scoring

methods. We interpreted the results with two or more datasets (see Table 4).



Regarding the person-by-task-by-rater design, Table 4 indicates that the variances

explained by the person-by-task interactions were, overall, larger for datasets using both

independent and integrated tasks (10.80%), based on the general context (13.01%) than those

using independent tasks only (6.41%), and those that were based on both general and

academic contexts (9.20%), and those based only on academic contexts (1.80%). Further, the

interactions were larger for analytic scoring (12.14%) than for holistic scoring (6.37%). In

contrast, for the person-by-task-by-rating design, these interactions were larger for the

datasets that used only integrated tasks (16.40%), based on both general and academic

contexts (17.43%), than those using both independent and integrated tasks (9.55%), based on

an academic context (2.23%). Further, analytic scoring (16.40%) seems to be associated with

larger person-by-task interactions than holistic scoring (10.95%).

Other than task types, contexts, and scoring types, we found another factor that may

explain large person-by-task interactions: task-specificity in analytic scoring criteria. Two

studies containing analytic scoring criteria with task-specific features (Grabowski, 2009; H.-J.

Kim, 2009) used instruments for assessing pragmatic competence and showed the largest

(17.33%) and third largest (7.49%) person-by-task interactions in the person-by-task-by-rater

design in case of the datasets used for this study (see Appendix). Please note that we

randomly selected one dataset from each study to avoid dependency; if we considered all

datasets from the two studies, they had the largest and second largest person-by-task


interactions (63.30% and 38.16%, respectively). Grabowski (2009) used five analytic scoring

criteria characterized by task-specific language features: sociocultural appropriateness (e.g.,

metaphor; 63.30% of the variance explained by the person-by-task interactions),

psychological appropriateness (e.g., sarcasm, irony, anger; 57.65%), sociolinguistic

appropriateness (e.g., age, status, power, register; 47.64%), grammatical meaningfulness

(28.14%), and grammatical accuracy (17.33%). Similarly, H.-J. Kim (2009) used six analytic

scoring criteria: sociolinguistic competence (i.e., social appropriateness; 15.91%–38.16%),

task completion (12.75%–26.43%), meaningfulness (10.59%–19.47%), discourse competence

(6.70%–15.14%), grammatical competence (6.28%–12.90%), and intelligibility (4.44%–

10.92%). The larger person-by-task interactions in these two studies seem to indicate

relationships with analytic scoring with task-specific features.

4.2.3 Summary in L2 Speaking

We compared task and rater effects across studies, and observed the following:

Most of score variation reflected differences in the examinees’ performance in

all the designs (40.78%–86.16%).

Task effects were larger than rater (or rating) effects in the person-by-task-by-

rater (2.06% vs. 1.91%) and person-by-task-by-rating (0.70% vs. 0.05%)

designs.

The person-by-task interaction effects were larger than the person-by-rater (or


rating) interaction effects (9.78% vs. 1.85%; 13.07% vs. 6.38%).

These interaction effects made larger contributions to the score variances when

compared with tasks or raters per se.

Two common findings across the designs were the following: (1) a larger

person-by-task interaction was related to assessments using both general and

academic contexts than those using academic contexts only; and (2) a larger

person-by-task interaction was related to analytic scoring than holistic scoring.

More specifically, a larger person-by-task interaction was related to analytic

scoring with task-specific features assessing pragmatic competence.

4.3 L2 Writing

4.3.1 Percentages of Variance Component Explained by Each Facet

Table 5 shows that most of the variance components in the person-by-task design

were roughly equally attributable to persons (44.83%) and person-by-task interactions with

undifferentiated errors (35.20%), whereas a smaller percentage of the variance was explained

by tasks (19.97%). Similar results were found for the person-by-rater design.


The results of the three-way designs were also consistent. Regardless of the designs,

overall, the variances were explained by persons, two-way interactions (e.g., person-by-task

or person-by-rater), and three-way interactions with undifferentiated errors (i.e., person-by-


task-by-rater, person-by-task-by-rating, and person-by-rater-by-criterion). For example, for

the person-by-task-by-rater design, the variance components were primarily attributable to

persons (64.09%), person-by-task interactions (14.53%), and person-by-task-by-rater

interactions (13.01%). Additionally, the variance components for the task were noticeable for

the person-by-task-by-rater and person-by-task-by-rating designs (5.49% and 9.46%,

respectively), as were the variance components for the rater for the person-by-rater-by-

criterion design (21.15%).

When we compared task and rater effects in the person-by-task-by-rater and person-

by-task-by-rating designs, we found task effects were larger than rater (or rating) effects in

the person-by-task-by-rater (5.49% vs. 1.13%) and person-by-task-by-rating (9.46% vs.

0.36%) designs. In addition, the person-by-task interaction effects were larger than the

person-by-rater (or rating) interaction effects (14.53% vs. 1.27%; 19.28% vs. 1.06%). Thus,

similar to the results for L2 speaking, the interaction effects involving tasks contributed more

to the score variances than did the raters. Unlike L2 speaking, however, the task facet also

contributed to the variances in the two designs. One unique feature of the person-by-rater-by-

criterion design was a noticeable rater effect and a person-by-rater effect, which was

consistent across L2 speaking and writing.

4.3.2 Results From Moderator Variable Analysis

Careful scrutiny of the relationship between the person-by-task interactions and study


characteristics (Table 4) was difficult due to the small number of datasets. If we interpreted

results with two or more datasets, the results for the person-by-task-by-rating design suggest

that the person-by-task interactions were larger for the datasets based only on academic

contexts (23.89%), as compared to those based on both general and academic contexts

(18.58%).

4.3.3 Summary in L2 Writing

We compared the task and rater effects across studies and found that

Most of score variation reflected differences in the examinees’ performance in

all the designs except for the person-by-rater-by-criterion design (44.83%–

82.62%).

The task effects were larger than rater (or rating) effects in the person-by-task-

by-rater (5.49% vs. 1.13%) and person-by-task-by-rating (9.46% vs. 0.36%)

designs.

The person-by-task interaction effects were larger than the person-by-rater (or

rating) interaction effects (14.53% vs. 1.27%; 19.28% vs. 1.06%).

These interaction effects made larger contributions to the score variances as

compared with tasks or raters per se.

In one design, the person-by-task interactions were larger for the datasets based

only on academic contexts, rather than for those based on both general and


academic contexts.

5 Discussion

Research Question 1: Which effects—task and task-related interaction effects or rater

and rater-related interaction effects—explained more of the score variances?

On average, the variance components were primarily explained by persons across L2

speaking and writing, although the relatively large SDs indicated varied effects of persons

across studies. The interaction effects related to tasks, raters, ratings, and scoring criteria

followed this. However, exceptions were observed for the person-by-rater-by-criterion design.

Specifically for this design, we observed smaller person variations and larger variations due

to effects of raters and person-by-rater interactions. One reason for this may be that a facet of

scoring criterion is not very typical because “a facet is simply a set of similar conditions of

measurement” (Brennan, 2001, p. 5), and different scoring criteria are primarily used to

assess different aspects of speaking and writing; thus, the person-by-rater-by-criterion design

may need to be analyzed as the person-by-rater design with scoring criteria as multivariate

variables in multivariate G theory. Therefore, the possible misspecification of the design may

have caused these inconsistent results. Another reason could be that multiple scoring criteria

alter rating processes, which could lead to larger rater-related variations. A third reason, as

suggested by a reviewer, could be the effect of unmodeled variables that might have affected

the proper estimation of other variance components. These possibilities need to be explored


further before we suggest implications based on this finding.

We compared task and rater effects in the person-by-task-by-rater and person-by-task-

by-rating designs across L2 speaking and writing. The task effects were consistently larger

than rater (or rating) effects. In addition, the person-by-task interaction effects were larger

than the person-by-rater (or rating) interaction effects. Thus, task and task-related interaction

effects made larger contributions to the score variances than the rater and rater-related

interaction effects. Furthermore, the interaction effects were not negligible in all designs.

Compared with tasks or raters per se, interaction effects involving persons, tasks, raters (or

ratings and scoring criteria) made larger contributions to the score variances. These results

suggest the importance of considering interaction effects in test design, development, and

validation.

These findings, in general, concur with those of the previous studies reviewed above.

Brown (2011) reported that most studies showed a larger effect of the person-by-task

interactions than person-by-rater interactions, which is similar to the general trend of a larger

person-by-task interaction effect found in the current study. While the person-by-task

interactions were large on average in L2 speaking and writing (9.78%–19.28%), they were

not particularly large as compared to findings in other fields (e.g., 27.46% in L1 writing

reported in C. Huang, 2009). However, it should be noted that the larger impact of task-

related factors rather than that of rater-related factors may be restricted to cases in which rater


training was (likely) conducted, because a majority of the datasets (79.07%, 34/43) included

in our synthesis were obtained from studies that involved rater training. This was evidenced

by the L2 speaking results in the person-by-task-by-rater design. When the percentage of the

variance components was calculated (not shown in tables) in the three studies without rater

training (see Appendix), the variance attributable to the person-by-rater interactions (9.01%;

SD = 3.06) was larger than that attributable to the person-by-task interactions (2.07%; SD =

1.83); rater variance (4.20%; SD = 3.44) , however, was smaller than task variance (13.97%;

SD = 11.25). Therefore, having obtained smaller rater-related variances than task-related

variances does not suggest that fewer resources should be spent on rater training; rather, the

person-by-rater interactions could be large without rater training. Additionally, variables other

than rater training, such as the quality of rating scales, may also be responsible.

Furthermore, L2 speaking and writing studies differed as follows. While rater effects

were overall small across modalities, a larger percentage of the variance was explained by the

task in L2 writing (5.49%–19.97%) than in L2 speaking (0.63%–2.06%). Similarly, the

person-by-task interactions explained a larger percentage of the variance in L2 writing than in

L2 speaking (14.53%–19.28% vs. 9.78%–13.07%). This means that the variability in rater

judgments was small overall, but the tasks differed in their levels of difficulty, and more so in

the case of L2 writing than in L2 speaking. Moreover, the ranking of the examinees differed,

depending on task difficulty, more in L2 writing. This may be explained by the use of tasks


with greater diversity of constructs and difficulty. For example, Abeywickrama (2008)

reported a massive 50.75% variance for summary, gap-fill, and essay tasks (14.24% for

person, and 35.01% for person-by-task interactions with undifferentiated errors; partially

reported in Table 5) that measured cohesion using analytic rating scales. Designed to measure

different aspects of writing constructs, they varied in difficulty. As performance assessments

comprise only a few tasks, the selection of tasks greatly influences the difficulty and

measured construct.

Although the results across the person-by-task-by-rater and person-by-task-by-rating

designs in L2 speaking and writing were similar, one difference was the larger magnitude of

the person-by-task interactions in the person-by-task-by-rating design than in the person-by-

task-by-rater design. This suggests that the two designs, while producing similar results

overall, are not the same (13.07% vs. 9.78%; 19.28% vs. 14.53%). In the person-by-task-by-

rating design, more raters are involved in the rating processes. Consequently, small

divergences in ratings for each response in a task could add up to larger person-by-task

interactions.

In summary, answers to Research Question 1 are that task effects were larger than

rater (or rating) effects, person-by-task interaction effects were larger than person-by-rater (or

rating) interaction effects, and that such interaction effects were larger than task or rater

effects per se.


Research Question 2: Are the degrees of the person-by-task interactions related to task

types, contexts, and scoring methods?

The synthesized results consistently showed that a high percentage of variance was

explained by the person-by-task interactions. A moderator variable analysis was conducted on

the person-by-task interactions for the person-by-task-by-rater and person-by-task-by-rating

designs. Since L2 writing had a limited number of datasets, we discuss results for L2

speaking with a specific focus on task types, contexts, and scoring methods. First, Xi and

Mollaun (2006) speculated that the person-by-task interactions would be larger for

assessments using both independent and integrated tasks, and for those based on both general

and academic contexts, than for those using only any one of the task or context types. This

speculation was partially supported across the designs in L2 speaking, where the person-by-

task interactions were larger for datasets based on both general and academic contexts, than

those based only on an academic context, across the two designs. In addition, in the person-

by-task-by-rater design, studies that used both independent and integrated tasks produced

larger interactions than did those that used only independent tasks. This indicates that a

broader construct definition, with more diverse task types and contexts, might increase the

person-by-task interactions. However, opposite trends were also observed. In the person-by-

task-by-rater design, datasets based on both general and academic contexts produced smaller

interactions than did those based only on a general context. Further, in the person-by-task-by-


rating design, studies that used both independent and integrated tasks produced smaller

interactions than did those that used integrated tasks. Thus, in L2 speaking, Xi and Mollaun’s

(2006) prediction is only partially corroborated.

Second, regarding scoring methods, Xi and Mollaun (2006) speculated that the

person-by-task interactions would be smaller in holistic scoring because the task-specific

language features in the scoring rubrics would have less of an influence when scored

holistically, with more underlying stable abilities in mind. This was supported by the two

designs in L2 speaking with analytically scored datasets producing larger interactions. Thus,

scoring methods seemed to be systematically related to the size of the person-by-task

interactions.

Additionally, consistent with Xi and Mollaun (2006), the scrutiny of Grabowski

(2009) and H.-J. Kim (2009) suggests a relationship between the person-by-task interactions

and scoring criteria (or rubrics) that reflect task-specific language features. Performance

variability was particularly attributable to the analytic scoring criteria measuring sociocultural

appropriateness (e.g., metaphor), psychological appropriateness (e.g., sarcasm, irony, anger),

and sociolinguistic appropriateness (e.g., age, status, power, register). Tasks measuring these

constructs are highly contextualized. Further, tasks that measure metaphor or sarcasm are

particularly dependent on the discourse and the social contexts in which they are embedded.

As expected, examinees may have performed differently across tasks. This suggests that


obtaining large person-by-task interactions is not necessarily undesirable or construct-

irrelevant, because it may indicate that contextual effects were well operationalized in the

tasks, and that the magnitudes of such interactions indicate the size of the contextual effects

on test performance. Thus, if an assessment instrument includes task-specific language

features in the analytic scoring criteria, such as those discussed above, researchers should

expect to observe large person-by-task interactions. Nevertheless, such large interactions

weaken the ability to generalize from the examinees’ performance in the sample of tasks, to

the universe consisting of all possible tasks in the performance assessment (i.e., the universe

of generalization; Kane, Crooks, & Cohen, 1999). We should be especially careful to define

the construct narrowly enough, to minimize performance variability, but broadly enough, to

ensure that domain representation is not undermined. This approach could prove useful for

large-scale assessments, where increasing the number of tasks would be difficult due to

logistic and time constraints (Xi & Mollaun, 2006).

In summary, in response to Deville and Chalhoub-Deville’s (2006), Schoonen’s

(2012), and Xi and Mollaun’s (2006) call for research into contextual features that relate to

person-by-task interactions, our results on L2 speaking indicate that analytic scoring is likely

to produce larger interactions than holistic scoring; analytic scoring criteria with more task-

specific language features generate larger variations. In addition, person-by-task interactions

are generally larger for assessments based on both general and academic contexts than for


those based only on an academic context. Thus, among many contextual variables, contexts,

scoring methods, and scoring criteria may lead to varied performance over tasks, and require

test developers and researchers to pay close attention to these variables. Results of task types

were not consistent across the designs.

While these results do not appear to be surprising, they have not been empirically

examined since calls for research were raised by Deville and Chalhoub-Deville (2006),

Schoonen (2012), and Xi and Mollaun (2006). Further, the results that the person-by-task

interactions were generally larger for assessments based on a general context only than for

those based on both general and academic contexts, and based on integrated tasks only than

on both independent and integrated tasks in each design, were not in line with Xi and

Mollaun (2006) or, in general, with Fulcher (2003) and Schoonen (2012). They argued that

the inclusion of different contexts produces more variation in scores. Explaining these

inconsistent results across the studies clearly shows the complexity of factors affecting L2

speaking and writing, which requires a greater number of primary studies and research

syntheses.

Implications for language testers

Since we usually want to generalize test results across tasks, we need to reduce

variances in the person-by-task interactions. One approach is to increase the number of tasks;


when the number of tasks is doubled, the task-related variances are halved, and we obtain

higher reliability (i.e., generalizability). When it is difficult to increase the number of tasks,

another way to accomplish this goal is to use similar tasks, as this increases the score

generalizability across tasks at the expense of narrowing down the construct definition. Our

findings for the effects of context suggest that when we intend to narrow the construct

definition in L2 speaking, one option is to employ only academic-context tasks rather than

using tasks from both general and academic contexts. Further, based on our results, the use of

holistic scoring over analytic scoring may be another option since it tends to provide fewer

variations, and thus, higher reliability, while maintaining the original construct definition. A

third method is to examine the construct we want to assess, the domain to which we want to

generalize the result, and the minimum reliability we should obtain to find a middle

ground―by adjusting the construct definition to the extent that we can maintain the domain

representation in an assessment. In this case, we should first examine whether the task-related

variations derived are large, and are related to the construct intended, such that it is preferable

to maintain them, like the large person-by-task variations observed in Grabowski’s (2009)

study of sociocultural appropriateness. In judging such relative magnitudes, our findings

about synthesized percentages of variances explained by tasks, raters, and their interactions

may serve as guidelines for judging whether a study’s percentages of variances explained by

tasks, raters, and their interactions, are comparatively large.


6 Summary, Future Research, and Limitations

This study examined (a) whether task and task-related interaction effects or rater and

rater-related interaction effects explain more of the score variances, and (b) whether the

degrees of person-by-task interactions are related to task types, contexts, and scoring methods.

A synthesis of studies using G theory was conducted, and the results showed that the task and

task-related interactions were more influential than the rater and rater-related interactions.

Regarding (b), L2 speaking studies show that the person-by-task interactions were larger for

assessments based on both general and academic contexts than for those based on academic

contexts only. Similarly, they were larger for assessments using analytic, rather than holistic,

scorings, and for analytic scoring criteria with task-specific language features such as

sociocultural or sociolinguistic appropriateness. Some of these results correspond well with

Xi and Mollaun’s (2006) prediction, but at the same time, differing trends were also observed

in some designs, which suggests the complexity of factors influencing L2 speaking and

writing.

We discuss two directions for future research. First, in addition to increasing the

number of studies investigating the effects of tasks, raters and their moderator variables for

further syntheses, it would be of great interest to compare percentages of variance

components (especially person-by-task interactions), using the same participants’ responses,

between only independent tasks, only integrated tasks, and both tasks in general and/or


academic contexts; between holistic and analytic scorings; and between speaking and writing.

We can analyze each separately or incorporate it as a facet in a generalizability study, and

examine how the person-by-task interactions would change in size, accordingly using studies

of similar designs. This method will enable researchers to hold the person facets constant, and

inspect the variables in focus. These studies are particularly needed, given the paucity of

studies on some designs observed during the current study, and that the task types

(independent vs. integrated) and contexts (general vs. academic) were often confounded.

Second, it is equally important to examine if the person-by-task interactions would be

larger for rating scales characterized by more richly contextualized and specific descriptors in

particular domains, as suggested by Xi and Mollaun (2006). Since such scales (e.g., Fulcher,

Davidson, & Kemp, 2011; Upshur & Turner, 1995) need to be designed for each task, they

are more likely to uniquely reflect the context and the complexities of language use. These

scales were not found among the datasets we analyzed, and investigation into this issue

would certainly hold promise for understanding the person-by-task interactions better.

The current study is limited in its narrow focus on synthesizing only studies on task

and rater effects that used G theory, while these effects have also been investigated using

other methods (e.g., Eckes, 2011; Knoch, 2011). The context, as defined in a G study, refers

to task characteristics such as independent/integrated task types or general/academic contexts,

and does not fully capture what Deville and Chalhoub-Deville (2006) call an “ability-in-


language user-in-context” approach to defining a construct. While this new broader approach

attempts to examine the socially and cognitively dynamic nature of context, our focus on the

G-theory-based studies necessarily restricted our operationalization of the context. Further,

we conducted an exhaustive search for studies that used G-theory, and believe that the

collected studies included a reasonably representative sample of speaking and writing tasks.

However, this does not mean that these collected studies included a representative sample of

all speaking and writing tasks used in L2 language testing. We focused on studies using G-

theory, and we are aware of studies on task and rater effects using other analytical

frameworks. Finally, the number of datasets for certain G-study designs was not large enough

to examine, in detail, the percentages of variance components accounted for by each facet in

relation to the moderator variables. In particular, a paucity of G studies in writing

assessments examining the person-by-task interactions suggests the need for more studies to

better understand the relative effects of various facets, and the way contextual variables

moderate these effects.

Nevertheless, we hope that the current study acts as a springboard for more inquiry

into the task and rater effects in L2 speaking and writing.

Acknowledgement

An earlier version of this paper was presented at the 2013 American Association for

Applied Linguistics Conference in Dallas, Texas, USA and at the 2013 Language Testing


Research Colloquium Conference in Seoul, South Korea. We thank Rob Schoonen for

helping us interpret our results, Yuko Hijikata and Wei-Li Hsu for locating and retrieving

some studies for the current synthesis, Yujia Zhou for assisting us in coding the articles

written in Chinese, and the Editor and two anonymous reviewers for their valuable comments

on earlier versions of this paper. This work was partially funded by the Japan Society for the

Promotion of Science (JSPS) KAKENHI, Grant-in-Aid for Scientific Research (C), Grant

Number 26370736 and 26370737. The authors contributed equally to this work. The coding

tables are available in The University of York & Georgetown University (n.d.).

References

References marked with an asterisk indicate articles included in the synthesis.

*Abdul Kadir, K. (2008). Framing a validity argument for test use and impact: The

Malaysian public service experience (Doctoral dissertation). Retrieved from ProQuest.

(AAT 3337680)

*Abeywickrama, P.-S. (2008). Measuring the knowledge of textual cohesion and coherence in

learners of English as a second language (ESL) (Doctoral dissertation). Retrieved from

ProQuest. (AAI 3288206)

*Alharby, E. R. (2006). A comparison between two scoring methods, holistic vs. analytic,

using two measurement models, the Generalizability Theory and the Many-facet Rasch

Measurement, within the context of performance assessment (Doctoral dissertation).


Retrieved from ProQuest. (AAT 3236860)

Bachman, L. F. (2007). What is the construct? The dialectic of abilities and contexts in

defining constructs in language assessment. In J. Fox, M. Wesche, D. Bayliss, L. Cheng,

C. E. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 41–71). University

of Ottawa Press.

*Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and

rater judgments in a performance test of foreign language speaking. Language Testing,

12, 239–257.

Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford, UK: Oxford

University Press.

*Bae, J., & Bachman, L. F. (2010). An investigation of four writing traits and two tasks

across two languages. Language Testing, 27, 213–234.

*Banno, E. (2008). Investigating an oral placement test for learners of Japanese as a second

language (Doctoral dissertation). Retrieved from ProQuest. (AAI 3300329)

*Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study.

Assessing Writing, 12, 86–107.

Bolus, R. E., Hinofotis, F. B., & Bailey, K. M. (1982). An introduction to generalizability

theory in second language research. Language Learning, 32, 245–258.

Brennan, R. L. (1996). Generalizability of performance assessments. In Phillips (Ed.).


Technical issues in large-scale performance assessment (pp. 19–58). Washington, DC:

National Center for Educational Statistics.

Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.

Brown, A. (Ed.). (2012). ILTA bibliography of language testing (6th Edition, 1999–2011) By

Category. Retrieved from

http://www.iltaonline.com/images/pdfs/2011_by_category.pdf

Brown, J. D. (2011). What do the L2 generalizability studies tell us? International Journal of

Assessment and Evaluation in Education, 1, 1–37.

*Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language

writing skills. Language Learning, 34, 21–38.

Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F.

Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and

language testing research (pp. 32–70). New York, NY: Cambridge University Press.

Deville, C., & Chalhoub-Deville, M. (2006). Old and new thoughts on test score variability:

Implications for reliability and validity. In M. Chalhoub-Deville, C. A. Chapelle, & P.

Duff (Eds.), Inference and generalizability in applied linguistics: Multiple perspectives

(pp. 9–25). Amsterdam, the Netherlands: John Benjamins.

*Eckes, T. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating

rater-mediated assessments. Frankfurt am Main: Peter Lang.


Fulcher, G. (1993). The construction and validation of rating scales for oral tests in English

as a foreign language. Unpublished PhD dissertation, University of Lancaster.

Fulcher, G. (2003). Testing second language speaking. Essex, U.K.: Pearson Education

Limited.

Fulcher, G., & Davidson, F. (2012). The Routledge handbook of language testing. New York:

Routledge.

Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking

tests: Performance decision trees. Language Testing, 28, 5–29.

Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit

it all? Language Testing, 26, 507–531.

*Grabowski, K. C. (2009). Investigating the construct validity of a test designed to measure

grammatical and pragmatic knowledge in the context of speaking (Doctoral

dissertation). Retrieved from ProQuest. (AAI 3368256)

*Hirai, A., & Koizumi, R. (2008). Validation of an EBB scale: A case of the Story Retelling

Speaking Test. JLTA (Japan Language Testing Association) Journal, 11, 1–20.

Huang, C. (2009). Magnitude of task-sampling variability in performance assessment: A

meta-analysis. Educational and Psychological Measurement, 69, 887–912.

*Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale

assessments? A generalizability theory approach, Assessing Writing, 13, 201–218.


*Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of

large-scale ESL writing assessment. Assessing Writing, 17, 123–139.

ILTA Bibliography of PhDs in Language Testing. (2011). Retrieved from

http://www.iltaonline.com/images/pdfs/phds_2010.pdf

In’nami, Y., & Koizumi, R. (2009). A meta-analysis of test format effects on reading and

listening test performance: Focus on multiple-choice and open-ended formats.

Language Testing, 26, 219–244. doi:10.1177/0265532208101006

In’nami, Y., & Koizumi, R. (2010). Database selection guidelines for meta-analysis in applied

linguistics. TESOL Quarterly, 44, 169–184. doi:10.5054/tq.2010.215253

In’nami, Y., & Koizumi, R. (2014). Research synthesis and meta-analysis in second language

learning and testing. English Teaching & Learning, 38(3), 1–27.

doi:10.6330/ETL.2014.38.3.01

Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational

Measurement: Issues and Practice, 18(2), 5–17.

*Kim, H.-J. (2009). Investigating the effects of context and task type on second language

speaking ability (Doctoral dissertation). Retrieved from ProQuest. (AAT 3368349)

*Kim, Y.-H. (2009). A G-theory analysis of rater effect in ESL speaking assessment. Applied

Linguistics, 30, 435–440.

*Kinshi, K., Kuru, Y., Masaki, M., Yamanishi, H., & Otoshi, J. (2011). Revising a writing


rubric for its improved use in the classroom. LET Kansai Chapter Collected Papers, 13,

113–124.

Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating

behavior–a longitudinal study. Language Testing, 28, 179–200.

*Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2

performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14,

1–23.

*Lee, Y.-W. (2005). Dependability of scores for a new ESL speaking test: Evaluating

prototype tasks (Monograph Series, MS-28). Retrieved from

http://www.ets.org/Media/Research/pdf/RM-04-07.pdf

Lee, Y.-W., Breland, H., Muraki, E. (2004). Comparability of TOEFL CBT prompts for

different native language groups (RR-04-24). Retrieved from

http://www.ets.org/Media/Research/pdf/RR-04-24.pdf

*Lee, Y.-W., Golub-Smith, M., Payton, C., & Carey, J. (2001). The score reliability of the Test

of Spoken English (TSE) from the generalizability theory perspective: Validating the

current procedure. Retrieved from ERIC database. (ED458241)

*Lee, Y.-W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating

prototype task and alternative rating schemes (Monograph Series, MS-31). Retrieved

from http://www.ets.org/Media/Research/pdf/RR-05-14.pdf


Lin, C.-K. (2014). Treating either ratings or raters as a random facet in a performance-based

language assessments: Does it matter? CaMLA Working Papers 2014-01. Cambridge

Michigan Language Assessments. Retrieved from

http://www.cambridgemichigan.org/sites/default/files/resources/workingpapers/CWP-

2014-01.pdf

Linacre, J. M. (2013). A user’s guide to FACETS: Rasch-Model computer programs (Program

manual 3.71.0). Retrieved from http://www.winsteps.com/a/facets-manual.pdf

Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.

Lumley, T., & O’Sullivan, B. (2005). The effect of test-taker gender, audience and topic on

task performance in tape-mediated assessment of speaking. Language Testing, 22, 415–

437.

Luoma, S. (2004). Assessing speaking. Cambridge, UK: Cambridge University Press.

*Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and Many-facet Rasch

measurement in the development of performance assessments of the ESL speaking

skills of immigrants. Language Testing, 15, 158–180.

*Mizumoto, A. (2008). Jiyu eisakubun ni okeru hyoteisha hyoka no shurui to shinraisei

[Types of evaluation by raters and reliability in an English essay]. The Institute of

Statistical Mathematics cooperative research report, 215, 43–49.

*Molloy, H., & Shimura, M. (2005). An examination of situational sensitivity in medium-


scale interlanguage pragmatics research. In T. Newfields, Y. Ishida, M. Chapman, & M.

Fujioka (Eds.), Proceedings of the May 22–23, 2004 JALT Pan-SIG Conference (pp.

16–32). Retrieved from http://www.jalt.org/pansig/2004/HTML/ShimMoll.htm

*Nekoda, H. (2006). Developing a standard of language proficiency required for English

teachers: Using generalizability theory and item response theory. Annual Review of

English Language Education in Japan, 17, 191–200.

Norris, J. M., & Ortega, L. (2006). The value and practice of research synthesis for language

learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on

language learning and teaching (pp. 3–50). Philadelphia, PA: John Benjamins.

*Ohkubo, N. (2006). Shido to hyoka no ittaika wo mezashita shinraisei no takai eisakubun

hyoka kijun hyo no sakusei: Tahenryo ippanka kanousei riron wo mochiite

[Development of a reliable and valid scale of English writing assessment using

multivariate generalizability theory]. STEP Bulletin, 18, 14–29.

Ortega, L. (2008). Understanding second language acquisition. New York: Routledge.

Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and

challenges. Annual Review of Applied Linguistics, 30, 85–110.

Robinson, P. (Ed.). (2012). The Routledge encyclopedia of second language acquisition. New

York: Routledge.

*Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment:


Reporting a score profile and a composite. Language Testing, 24, 355–390.

Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation

modeling. Language Testing, 22, 1–30.

Schoonen, R. (2012). The generalisability of scores from language tests. In G. Fulcher & F.

Davidson (Eds.), The Routledge handbook of language testing (pp. 363–377). New

York: Routledge.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Thousand Oaks,

CA: Sage.

Shohamy, E., & Hornberger, N. H. (Eds.). (2010). Encyclopedia of language and education:

Volume 7: Language testing and assessment. New York: Springer.

Skehan, P. (1998). A cognitive approach to language learning. Oxford, UK: Oxford

University Press.

*Stansfield, C. W., & Kenyon, D. M. (1992a). Research on the comparability of the oral

proficiency interview and the simulated oral proficiency interview. System, 20, 347–364.

*Stansfield, C. W., & Kenyon, D. M. (1992b). The development and validation of a

Simulated Oral Proficiency Interview. Modern Language Journal, 76, 129–141.

*Tang, X. (2006). Investigating the score reliability of the English as a foreign language

performance test (Doctoral dissertation). Retrieved from ProQuest. (MR18823)

The University of York & Georgetown University. (n.d.). IRIS: A digital repository of data


collection instruments for research into second language learning and teaching.

Retrieved from http://www.iris-database.org/iris/app/home/index

Upshur J., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT

Journal, 49, 3–12.

van Moere, A. (2006). Validity evidence in a university group oral test. Language Testing, 23,

411–440.

*Wang, H. (2010). Investigating the justifiability of an additional test use: An application of

assessment use argument to an English as a foreign language test (Doctoral

dissertation). Retrieved from ProQuest. (AAT 3441468)

*Wang, L., Eignor, D., & Enright, M. K. (2008). A final analysis. In C. A. Chapelle, M. K.

Enright, & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as

a Foreign Language (pp. 259–318). New York: Routledge.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15,

263–287.

*Xi, X. (2003). Investigating language performance on the graph description task in a semi-

direct oral test (Doctoral dissertation). Retrieved from ProQuest. (3100694)

Xi, X. (2007). Evaluating analytic scoring for the TOEFL®

Academic Speaking Test (TAST)

for operational use. Language Testing, 24, 251–286.

Xi, X. (2015, March). Language constructs revisited for practical test design, development


and validation. Paper presented at the 37th Language Testing Research Colloquium,

Toronto, Ontario, Canada.

*Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL

Academic Speaking Test (TAST). (TOEFL iBT Research Report, RR-06-07). Retrieved

from http://www.ets.org/Media/Research/pdf/RR-06-07.pdf

*Xi, X., & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test.

Language Learning, 61, 1222–1255.

*Yamanishi, H. (2005). Ippanka kanousei riron wo mochiita kokosei no jiyueisakubun hyoka

no kento [Using Generalizability Theory in the evaluation of L2 writing]. JALT Journal,

27, 169–185.

*Zhou, Y. (2012). Generalizability of scores on structured and constructed-response tasks in

computer-delivered speaking assessment. Manuscript submitted for publication.


Appendix The Moderator Variables in the Person-by-Task-by-Rater (Rating) Designs in Speaking and Writing Author(s), N pxt (Range for

all datasets)a

Study context; examinees’ L1/L2

Publication status; G-study type

Task type; context; scoring type (Analytic scoring criteria)

Rater type; training

Speaking p x t x r

Grabowski (2009), 102

17.33 (17.33–63.30)

Second; varied & E Dissertation; multivariate

Independent; general; analytic (Grammatical accuracy and other 4 criteria)

Native; yes

Xi & Mollaun (2006), 140

13.75 (13.73–21.27)

Second; varied & E Published; multivariate

Both; both; analytic (Delivery and other 2 criteria)

Both; yes

H.-J. Kim (2009), 162

7.49 (4.44–38.16)

Second; varied & E Dissertation; multivariate

Independent, integrated; both; analytic (Grammatical competence and other 5 criteria)

Unknown; yes

Stansfield & Kenyon (1992a), 40

6.80 (0.00–9.00)

Varied; varied & Hebrew

Published; univariate

Independent; general; holistic Unknown; maybe yes

Xi & Mollaun (2011), 100

6.68 (6.68–12.55)

Second; varied & E Published; univariate

Both; both; holistic Nonnative; yes

Stansfield & Kenyon (1992b), 16

4.51 (3.85–5.16)

Foreign; unknown & Indonesian


Independent; general; holistic Both; maybe yes

Xi (2003), 20 3.91 Varied; varied & E Dissertation; univariate

Independent; both; holistic Unknown; yes

Banno (2008), 6

3.64 for holistic 4.73 for analytic

Second; Chinese & Japanese

Dissertation; univariate

Independent; general; holistic and analytic (5 criteria)

Native; no

Nekoda (2006), 26

2.48 Foreign; Japanese & E


Independent; academic; holistic

Nonnative; no

Y.-H. Kim (2009), 10

0.06 (0.06–0.28)

Second; Korean & E


Independent; academic; holistic

Both; no

(Continued)


Appendix The Moderator Variables in the Person-by-Task-by-Rater (Rating) Designs in Speaking and Writing (Continued) Author(s), N pxt (Range for

all datasets)a

Study context; examinees’ L1/L2

Publication status; G-study type

Task type; context; scoring type (Analytic criteria)

Rater type; training

Speaking p x t x r’

Tang (2006), 1099

21.97 (11.00–23.50)

Foreign; Chinese & E

Dissertation; univariate

Integrated; both; analytic Unknown; yes

Lee (2005), 261

17.27 Both; varied & E Published; univariate

Both; both; holistic Unknown; maybe yes

Lee et al. (2001), 1,766

11.45 (11.45–12.74)

Second; varied & E Unpublished; univariate

Independent; general; holistic Native; maybe yes

Wang et al. (2008), 373

4.00 Second; varied & E Published; univariate

Both; both; holistic Native; yes

Sawaki (2007), 214

3.45 (0.00–3.45)

Foreign; English & Spanish

Published; multivariate

Integrated; academic; analytic (Pronunciation and other 4 criteria)

Both; yes

Bachman et al. (1995), 218

1.04 Foreign; varied & Spanish


Integrated; academic; analytic (Grammar)

Both; yes

Writing p x t x r

Lee & Kantor (2005), 162


Both; both; holistic Unknown; yes

Bae & Bachman (2010), 317

13.27 (6.41–24.69)

Foreign; Korean & E


Independent; general; analytic (Content and grammar)

Unknown; unknown

Barkaoui (2007), 16

5.01 for holistic

b

Foreign; unknown & E


Independent; general; holistic and analytic (5 criteria)

Both; no

p x t x r’ J. Huang (2008), 323

32.97 (26.55–39.25)


Both; academic; holistic Unknown; maybe yes

Lee & Kantor (2005), 488


Both; both; holistic Unknown; yes

Wang et al. (2008), 2677

18.32 Second; varied & E Published; univariate

Both; both; holistic Native; maybe yes

J. Huang (2012), 154

4.84 (3.57–10.11)


Integrated; academic; analytic Unknown; yes

Note. Underlined = randomly selected and used for the synthesis in Tables 3 or 5. E = English. aThis is reported to show that our

random selection of one dataset among several within a single study did not considerably affect the results. b37.58 for analytic.


Table 1 The Characteristics of the Datasets and Moderator Variables Speaking (k = 28) Writing (k = 22) Number of examinees Mean (SD) 409.61 (658.63) 308.14 (594.03) Minimum-maximum 6–2305 16–2677

Number of tasks (k = 20) (k = 10) Mean (SD) 6.00 (4.12) 2.16 (2.46) Minimum-maximum 2–12 1–12

Number of raters (k = 15) (k = 15) Mean (SD) 14.29

(33.59)

a 3.22 (1.96)

Minimum-maximum 2–130a 2–10

Number of ratings (k = 9) (k = 4) Mean (SD) 3.33 (3.62) 2.50 (1.00) Minimum-maximum 2–12 2–4

Number of scoring criteria (k = 2) (k = 8) Mean (SD) 13.50 (14.85) 6.13 (3.72) Minimum-maximum 3–24 3–15

Number of studies G-study design:

Person x task 4 (see Table 2) 3b

Person x rater 3c 4

d

Person x rating 3e 0

Person x task x rater 10 (see Appendix) 3 (see Appendix) Person x task x rating 6 (see Appendix) 4 (see Appendix) Person x rater x criterion 2

f 8

g

L2 research context: Second/ foreign language/both

16, 8, 4

11, 8, 3

Examinees’ L1: Varied/ Japanese/Chinese/Arabic/others

18, 3, 2, 0, 2

8, 5, 1, 2, 1

Examinees’ L2: English/Spanish/others

22, 2, 4

21, 0, 1

Scoring method: Holistic vs. analytic

18, 10

6, 16

Publication status: Published/Dissertation/ Unpublished

16, 8, 4 14, 8, 0

Type of G study: Uni- vs. multivariate

21, 7 20, 2

Task: Independent/integrated/both (k = 20) 11, 4, 5 (k = 10) 4, 2, 4 General/academic/both/occupational 6, 4, 9, 1 4, 2, 3, 1

Rater: Native/nonnative/both (k = 24) 9, 4, 4 (k = 19) 5, 2, 1 Training: Yes/Maybe yes

h/No 13, 8, 3 10, 3, 4

Note. The totals do not always equal the total number of datasets since some did not report such information. k = the number of datasets.

aStatistics excluding Banno (2008)

are as follows: M = 5.38, SD = 4.56, range = 2–14. b(Abdul Kadir, 2008; Abeywickrama,

2008; Molloy & Shimura, 2005). c(Abdul Kadir, 2008; Grabowski, 2009; Stansfield &

Kenyon, 1992a). d

(Abdul Kadir, 2008; Abeywickrama, 2008; Alharby, 2006; Wang, 2010). e(Lee et al., 2001; Wang et al., 2008; Zhou, 2012).

f(Kondo, 2010; Lynch & McNamara,

1998). g(Abdul Kadir, 2008; Alharby, 2006; Brown & Bailey, 1984; Eckes, 2011; Kinshi

et al., 2011; Mizumoto, 2008; Ohkubo, 2006; Yamanishi, 2005). hStudies that did not

mention rater training but probably used it, considering the standard rating procedures of the test (e.g., rating the TOEFL iBT tasks).


Table 2 The Datasets for the Person-by-Task Design in Speaking

person (p) task (t) p x t Lee et al. (2001, N = 1766) 72.76 0.69 26.55 Lee (2005, N = 261) 73.83 0.52 25.65 Hirai & Koizumi (2008, N = 49) 70.90 1.20 27.90 Abdul Kadir (2008, N = 160) 88.86 0.00 11.14 Sample-size-weighted mean 74.00 0.63 25.38 SD 8.27 0.50 7.84


Table 3 The Synthesized Percentages of the Variance Components for Speaking person (p) x task (t) (k = 4)

p t p x t NM 74.00 0.63 25.38 SD 8.27 0.50 7.84 Min 70.90 0.00 11.14 Max 88.86 1.20 27.90

Person (p) x rater (r) (k = 3)

p r p x r NM 86.16 3.32 10.51 SD 6.63 2.79 3.88 Min 80.15 0.10 6.61 Max 93.30 5.68 14.17

Person (p) x rating (r’) (k = 3)

p r’ p x r’ NM 85.68 1.15 13.17 SD 0.56 0.83 0.323 Min 85.48 0.00 12.86 Max 86.56 1.66 13.44

Person (p) x task (t) x rater (r) (k = 10)

p t R p x t p x r t x r p x t x r NM 67.74 2.06 1.91 9.78 1.85 1.05 15.61 SD 16.16 6.66 2.52 5.24 4.36 3.65 8.65 Min 33.35 0.00 0.00 0.07 0.00 0.00 4.40 Max 92.45 21.47 6.46 17.33 11.38 11.70 27.88

Person (p) x task (t) x rating (r’) (k = 6)

p t r’ p x t p x r’ t x r’ p x t x r’ NM 57.82 0.70 0.05 13.07 6.38 0.26 21.72 SD 19.91 0.72 0.05 8.49 4.27 0.38 11.03 Min 51.26 0.00 0.00 1.04 0.00 0.00 4.00 Max 92.00 1.84 0.12 21.97 10.35 0.92 31.72

Person (p) x rater (r) x scoring criterion (c) (k = 2)

p r c p x r p x c r x c p x r x c NM 40.78 12.13 4.09 10.57 4.99 2.52 24.90 SD 3.45 7.26 1.01 6.09 0.19 3.86 0.068 Min 38.18 6.67 3.42 6.53 4.85 0.00 24.85 Max 43.06 16.93 4.85 15.15 5.12 5.45 24.95

Note. k = Number of datasets; NM = Sample-size-weighted mean; Min = Minimum; Max = Maximum. Boldfaced and italicized figures are those related to research questions from our study. This note also applies to Tables 4 and 5.


Table 4 The Characteristics of Task Types, Contexts, and Scoring and Synthesized Percentages of the Variance Components of the Person-by-Task Interactions

Task type Context Scoring type

Inde-

pend-

ent

Inte-

grated

Both Gen-

eral

Aca-

demic

Both Holis-

tic

Ana-

Lytic

Speaking k 8 0 2 4 2 4 7 3

Person (p)

x task (t)

x rater (r)

NM 6.41 10.80 13.01 1.80 9.20 6.37 12.14

SD 5.22 5.00 6.32 1.71 4.15 2.36 4.98

Min 0.06 6.68 3.64 0.06 3.91 0.06 7.49

Max 17.33 13.75 17.33 2.48 13.75 6.80 17.33

Person (p)

x task (t)

x rating

(r’)

k (1) 3 2 (1) 2 3 3 3

NM (11.45) 16.40 9.55 (11.45) 2.23 17.43 10.95 16.40

SD NA 11.5 9.53 NA 1.71 9.35 6.75 11.5

Min (11.45) 1.04 4.00 (11.45) 1.04 4.00 4.00 1.04

Max (11.45) 21.97 17.27 (11.45) 3.45 21.97 17.27 21.97

Writing k 2 0 (1) 2 0 (1) 2 (1)

Person (p)

x task (t)

x rater (r)

NM 12.87 (17.93) 12.87 (17.93) 16.77 (13.27)

SD 5.36 NA 5.36 NA 7.82 NA

Min 5.01 (17.93) 5.01 (17.93) 5.01 (13.27)

Max 13.27 (17.93) 13.27 (17.93) 17.93 (13.27)

Person (p)

x task (t)

x rating

(r’)

k 0 (1) 3 0 2 2 3 (1)

NM (4.84) 19.91 23.89 18.58 19.91 (4.84)

SD NA 8.02 19.89 1.19 8.02 NA

Min (4.84) 18.32 4.84 18.32 18.32 (4.84)

Max (4.84) 32.97 32.97 20.00 32.97 (4.84)

Note. When only one dataset was available, we bracketed it to show that we did not interpret it.

Running head: RATER AND TASK EFFECTS 55

Table 5 The Synthesized Percentages of the Variance Components for Writing Person (p) x task (t) (k = 3)

p T p x t NM 44.83 19.97 35.20 SD 39.55 26.05 21.48 Min 14.24 0.45 8.93 Max 90.61 50.75 51.54

Person (p) x rater (r) (k = 4)

p R p x r NM 82.62 4.99 12.40 SD 28.40 14.01 15.66 Min 28.42 0.05 3.60 Max 96.00 29.74 41.84

Person (p) x task (t) x rater (r) (k = 3)

p T r p x t p x r t x r p x t x r NM 64.09 5.49 1.13 14.53 1.27 0.50 13.01 SD 23.75 4.14 2.71 6.54 0.77 0.81 28.55 Min 26.23 0.00 0.77 5.01 0.00 0.00 6.63 Max 73.72 4.34 5.73 17.93 1.38 1.43 61.60

Person (p) x task (t) x rating (r’) (k = 4)

p T r’ p x t p x r’ t x r’ p x t x r’ NM 60.85 9.46 0.36 19.28 1.06 0.10 8.90 SD 13.83 6.42 1.14 9.53 15.97 0.00 12.67 Min 40.66 1.10 0.00 4.84 0.00 0.00 3.05 Max 67.94 10.77 3.30 32.97 22.58 1.10 29.23

Person (p) x rater (r) x scoring criterion (c) (k = 8)

p R c p x r p x c r x c p x r x c NM 16.24 21.15 3.77 10.75 4.83 3.46 39.80 SD 19.78 22.50 3.80 3.15 4.85 7.90 23.61 Min 0.15 0.00 0.19 7.16 0.09 0.00 9.14

Max 60.00 61.10 11.91 17.41 13.29 22.15 77.74

Documents

第二言語スピーキング・ライティングにおけるタス …koizumi/LT2016_Innami_Koizumi_Task...Since he focused on task and person-by-task interactions, rater facets