Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Running head: TASK AND RATER EFFECTS 1
The final, definitive version of this paper has been published in Language Testing, 33(3), July
2016 published by SAGE Publishing, All rights reserved.
In’nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A
synthesis of generalizability studies. Language Testing, 33, 341-366.
doi:10.1177/0265532215587390
(SAGE Publications, UK & USA)
[第二言語スピーキング・ライティングにおけるタスクと評価者の影響:一般化可能
性理論を使用した研究の統合]
[Dataset for the synthesis (information on studies included and excluded in the synthesis);
The same file is also found in IRIS (http://www.iris-database.org/iris/app/home/index)]
Running head: TASK AND RATER EFFECTS 2
Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies
Yo In’nami
Chuo University, Japan
Rie Koizumi
Juntendo University, Japan
Abstract
We addressed Deville and Chalhoub-Deville’s (2006), Schoonen’s (2012), and Xi and
Mollaun’s (2006) call for research into the contextual features that are considered related to
person-by-task interactions in the framework of generalizability theory in two ways. First, we
quantitatively synthesized the generalizability studies to determine the percentage of variation
in L2 speaking and L2 writing performance that was accounted for by tasks, raters, and their
interaction. Second, we examined the relationships between person-by-task interactions and
moderator variables. We used 28 datasets from 21 studies for L2 speaking, and 22 datasets
from 17 studies for L2 writing. Across modalities, most of the score variation was explained
by examinees’ performance; the interaction effects of tasks or raters were greater than the
independent effects of tasks or raters. Task and task-related interaction effects explained a
greater percentage of the score variances, than did the rater and rater-related interaction
effects. The variances associated with the person-by-task interactions were larger for
Running head: TASK AND RATER EFFECTS 3
assessments based on both general and academic contexts, than for those based only on
academic contexts. Further, large person-by-task interactions were related to analytic scoring
and scoring criteria with task-specific language features. These findings derived from L2
speaking studies indicate that contexts, scoring methods, and scoring criteria might lead to
varied performance over tasks. Consequently, this particularly requires us to define constructs
carefully.
Keywords
generalizability theory, research synthesis, L2 speaking, L2 writing, task, rater
Corresponding author
Yo In’nami, Division of English Language Education, Faculty of Science and Engineering,
Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan.
E-mail: [email protected]
1 Introduction
Assessing speaking and writing proficiency in performance testing presents numerous
challenges because, in addition to the examinees’ speaking and writing proficiency, many
variables are involved in the assessment process, including variability in assessment tasks and
Running head: TASK AND RATER EFFECTS 4
rater judgments. Speaking and writing tasks produce performance variability due to a number
of variables such as task characteristics, topics, and prompts (e.g., Lee, Breland, & Muraki,
2004; Lumley & O’Sullivan, 2005; Skehan, 1998). Further, even after raters have been
trained to apply the same scoring criteria and level of severity consistently, raters exhibit
variations in severity and consistency (e.g., Weigle, 1998; Knoch, 2011). The impact of tasks
and raters has been widely investigated (e.g., Grabowski, 2009; Barkaoui, 2007) since
examinees’ performance on tasks is evaluated by raters, and subsequent scores form the basis
for all analysis and interpretation. Task and rater impacts have been examined mainly through
the many-facet Rasch measurement (e.g., Eckes, 2011; Weigle, 1998) and generalizability (G)
theory (e.g., Gebril, 2009; Lee, 2005). While the former enables the adjustment of scores and
investigations of differences in task difficulty, rater severity, and consistency, G theory can
decompose the score variances into those affected by the numerous factors and their
interactions. While an individual generalizability study provides valuable information on
variations in performance in the study’s context, a synthesis of previous generalizability
studies offers deep insights into overall trends across studies, which can be informative for
test developers and researchers. Such a synthesis is particularly valuable given that many
variables affect each other in intricate ways in speaking and writing assessments (e.g.,
Brennan, 1996; Xi & Mollaun, 2006). Moreover, a synthesis allows for a systematic
examination of the effects of variables by classifying them according to moderator (i.e.,
Running head: TASK AND RATER EFFECTS 5
contextual) variables, even when such effects were not examined in the original studies.
Using this method, we will examine the magnitude of the impact of tasks and raters on L2
speaking and writing performance.
2 Literature Review
2.1 Task and Rater Effects in L2 Speaking and Writing in G Theory
G theory is a flexible, statistical framework for the systematic investigation of the
score variability in instruments under specific conditions by considering multiple sources of
error (Shavelson & Webb, 1991). G theory allows us to investigate, in a single analysis, the
independent and interactive effects of various factors such as examinees (persons), raters, and
items/tasks. For example, it is possible to determine what percentage of the variance of test
scores is due to the factors associated with persons or person-by-task interactions. Many
studies have used G theory to examine task and rater effects in L2 speaking (e.g., Stansfield
& Kenyon, 1992a, 1992b) and L2 writing research (e.g., Barkaoui, 2007; Wang, 2010). For
example, Barkaoui (2007) examined the impact of rating scale type (holistic vs. multiple
traits) on the L2 writing performance ratings of untrained raters. With regard to the holistic
scale, most of the variance was attributable to the person-by-task-by-rater interactions
(61.59%), followed by persons (26.23%), raters (5.73%), person-by-task interactions (5.01%),
and task-by-rater interactions (1.43%), with no impact for tasks and person-by-rater
Running head: TASK AND RATER EFFECTS 6
interactions (both 0.00%).1 The results from averaged multiple traits scales showed that the
largest amount of the variance was explained by raters (58.18%), followed by person-by-task-
by-rater interactions (37.58%), and persons (4.24%). Variances attributable to tasks and
person-by-task, person-by-rater, and task-by-rater interactions were negligible (0.00% for
each of these effects). This indicates that holistic scoring produces more reliable results than
multiple trait scoring when raters are not trained.
Previous generalizability studies showed that person-by-task interaction effects are
relatively large in general (C. Huang, 2009) in addition to large person effects. However, it is
still unclear to what extent task- and rater-related effects explain the score variance, and
which effects—task and task-related interaction effects, or rater and rater-related interaction
effects—explain more of the score variances (which leads to Research Question 1).
2.2 Person-by-Task Interactions
The relatively large person-by-task effects observed in previous studies (C. Huang,
2009) indicate that examinees tend to perform better or worse depending on the tasks. Thus,
the quality of the examinees’ performances is inconsistent, and is rank-ordered differently for
different tasks. The factors involved in these performance variabilities may include curricula,
textbooks, teachers, learning conditions, and individual differences in learning styles and
1 All of the percentages of the variance components reported here were calculated based on the variance
components reported in Barkaoui (2007). See Section 3.3 Analyses for details.
Running head: TASK AND RATER EFFECTS 7
personality (Brown, 2011).
While the person-by-task effects have been viewed as an undesirable source of score
variability, Deville and Chalhoub-Deville (2006) take the issue further and argue that these
interactions are related to the construct, and suggest the need to refine our construct definition
(see, for construct definition, Bachman, 2007; Chapelle, 1998; Xi, 2015). Schoonen (2012)
expands this line of discussion and argues that if we pay more attention to person-by-task
interactions, we could define a construct in ways that do not produce large person-by-task
effects. For example, he hypothesized that if writing a narrative is a construct different from
writing a letter of application, a generalizability study that only includes tasks for the latter
should not produce large person-by-task effects. If tasks for both constructs are included,
results may show a larger person-by-task effect because both domains are different, and
inconsistent performance is expected. To understand the person-by-task effects better, Deville
and Chalhoub-Deville (2006) and Schoonen call for research into the contextual features that
relate to score variability, including person-by-task effects.
Xi and Mollaun (2006) suggest two possible contextual features affecting the degree
of person-by-task interactions: (a) highly contextualized task characteristics, such as task
types (independent vs. integrated) and contexts (general vs. academic), and (b) types of rating
scales (analytic vs. holistic). They speculated that the person-by-task interactions would be
larger for assessments using both independent and integrated tasks, and those based on both
Running head: TASK AND RATER EFFECTS 8
general and academic contexts, than for those using either one of the task types or based on
any one context. By comparing their results with Lee’s (2005) findings, they also speculated
that task-specific features in analytic scoring are likely to contribute to larger person-by-task
interactions than those in holistic scoring, because these features in holistic scoring are
“considered globally with other more stable features,” and produce smaller person-by-task
interactions than analytic scoring does (Xi & Mollaun, p. 40). In addition, Fulcher (2003)
argued that very different tasks and task-specific rating scales produce large task-related
variances. However, as these are only speculations, Xi and Mollaun called for research into
variables related to person-by-task interactions (which leads to Research Question 2).
We believe that this can be particularly well suited for a research synthesis (In’nami
& Koizumi, 2014; Norris & Ortega, 2006) in which the contextual variables contributing to
person-by-task effects are more likely to be identified using systematically collected previous
studies. Researchers may obtain useful insights into the contextual variables related to a
universe of generalization, and a general trend. In this context, we are aware of two studies
synthesizing sources of variability in L2 generalizability studies.
C. Huang (2009) meta-analyzed the generalizability of the performance assessments
in education and psychology and reported, on average, a negligible effect of task, and a large
person-by-task interaction in L2 learning (k = 9; 3.65% and 15.06%) and L1 writing (k = 11;
2.99% and 27.46%). Since he focused on task and person-by-task interactions, rater facets (as
Running head: TASK AND RATER EFFECTS 9
well as persons) were not coded, and their relative effects were unknown. Huang also showed
that the variances of the person-by-task interactions were associated with variables such as
scoring methods. For instance, holistic scoring produced a greater amount of variance than
analytic scoring did, which contrasts the findings of Xi and Mollaun (2006). As Huang
focused on summarizing studies in education and psychology, he combined all the L2 studies
that he collected and did not classify them according to skills, which may have led to the
diverse results obtained.
Brown (2011) also synthesized L2 generalizability studies and reported that person-
by-task interactions explained from 0.45% to 39.06% of the variance in L2 performance tests,
exceeding the person-by-rater interactions in most (k = 10/13, 76.92%) of the studies. The
percentages of variance attributable to persons, tasks, and raters were 1.45% to 86.96%,
0.30% to 25.10%, and 0.00% to 61.10%, respectively. Although rater effects generally seem
to be more influential than task effects, their relative effects differ across studies. Brown
covered a wide range of areas of L2 testing, including the four skills, as well as multiple-
choice and performance tests. His method, however, seems to have some room for
improvement for four reasons: A comprehensive search for existing studies was not
conducted; studies such as those reporting only D studies were not included; the results were
not aggregated numerically; and studies across random and fixed effects designs were mixed.
The relatively large person-by-task interactions from these two studies indicate a
Running head: TASK AND RATER EFFECTS 10
limited degree of across-task generalizability of the examinees’ performance. What remains
unknown are the relative effects of task, rater, and their interactions in L2 speaking and L2
writing and, more importantly, the relationship between person-by-task interactions and the
moderator variables that previous studies deemed worth investigating. Such an investigation
is likely to benefit from more systematically collected previous studies.
2.3 Current Study
We examined generalizability studies to quantitatively synthesize the percentage of
variation in L2 speaking and writing performance that is accounted for by tasks, raters, and
their interactions. Tasks and raters are two widely studied variables of paramount interest and
importance to language testers. We also examined the relationships between person-by-task
interactions, which previous studies suggest are relatively large, and moderator variables of
interest based on previous studies (i.e., task types, contexts, and scoring methods). We
examined two research questions in L2 speaking and writing:
1. Which effects—task and task-related interaction effects or rater and rater-related
interaction effects—explained more of the score variances?
2. Are the degrees of the person-by-task interactions related to task types, contexts, and
scoring methods?
Our synthesis of previous studies differs from C. Huang (2009) and Brown (2011) on two
main points: (1) based on searching methods of research synthesis (In’nami & Koizumi 2010;
Running head: TASK AND RATER EFFECTS 11
Oswald & Plonsky, 2010), we collected previous L2 speaking and writing studies in a more
systematic manner; and (2) we examined a general trend in the effects of persons, tasks, raters,
and moderator variables, while classifying the results by G-study design and L2 skills.
3 Method
3.1 Data Collection
To find potential generalizability studies in language testing and learning, we
conducted an extensive search in May 2012, using three methods. First, we followed a
published method (In’nami & Koizumi, 2009) and retrieved studies through computer
searches on databases: the Educational Resources Information Center (ERIC), FirstSearch,
Google, Linguistics and Language Behavior Abstracts (LLBA), MLA International
Bibliography, ProQuest, PsycINFO, ScienceDirect, Scopus, and Web of Science. We used the
following keywords: generalizability/generalisability/g theory,
generalizability/generalisability/g study/studies, performance/alternative/authentic
assessment, task, and rater. This list of keywords was constructed based on the keywords and
synonyms retrieved from the thesauruses supplied in databases, books and articles reviewed,
authors’ experiences, and feedback from colleagues. Abstract, title, and article keyword
searches were used. Date range restriction was not imposed.
Second, books and journals in language testing, second language acquisition, and
educational measurement were reviewed. The books were those listed in In’nami and
Running head: TASK AND RATER EFFECTS 12
Koizumi (2009), with recent additions: Bachman and Palmer (2010), Fulcher and Davidson
(2012), the ILTA Bibliography of Language Testing (Brown, 2012), the ILTA Bibliography
of PhDs in Language Testing (2011), Shohamy and Hornberger (2010), and the Cambridge
Applied Linguistics series (e.g., Assessing Speaking by Luoma, 2004). The books in second
language acquisition primarily included Ellis (2008), Ortega (2008), and Robinson (2012).
Different editions of the same book were also checked. We also included 26 journals listed in
In’nami and Koizumi (2009) with the following additions: Applied Language Learning,
Assessing Writing, Foreign Language Annals, International Journal of Applied Linguistics,
International Review of Applied Linguistics in Language Teaching, Language Learning &
Technology, and Language Teaching Research.
Third, the relevant studies were searched through communication with other
researchers. In each of the three approaches, the reference list of every paper and chapter,
both published and unpublished, was scrutinized for additional relevant materials.
3.2 Criteria for the Inclusion of a Study
The literature search retrieved approximately 650 studies. Their titles, abstracts, and
study descriptors were inspected to check if they met the following criteria: (a) the study used
G theory, and (b) the test was designed to elicit a certain length of L2 self-created speaking
and/or writing performance (i.e., one or more sentences). Studies that used only reading-
aloud tasks or L1-to-L2 translation tasks were excluded. When two papers used the same data
Running head: TASK AND RATER EFFECTS 13
and designs, we selected the one with more information (e.g., we selected Xi & Mollaun,
2006 instead of Xi, 2007). At this stage, we narrowed down the studies to 45. A sample of
22.22% (n = 10) of the 45 studies was independently examined by both authors to determine
if they met the abovementioned criteria. The agreement percentage was 95, and the kappa
coefficient was 0.85. Disagreement was resolved through discussion. The remaining studies
were examined by the first author.
The studies were further inspected for if (a) persons were modeled as the object of
measurement; (b) all facets of measurement in a study (e.g., tasks, raters, etc.) were modeled
as random, not fixed, because fixed models do not have the aim of generalizing beyond the
condition of each facet (thus, Schoonen, 2005, for example, was excluded); (c) the variance
components or the percentage of variances explained for each facet for a G-study (or a D-
study with the number of each facet) was reported; and (d) moderator variables were reported,
including a G-study design (e.g., person-by-task, person-by-rater; crossed or nested;
univariate or multivariate), skill assessed (i.e., speaking and/or writing), and scoring methods
(i.e., holistic or analytic [including multiple traits]). We defined the task broadly, and included
direct and semi-direct formats. Task types and contexts were judged by examining sample
tasks and/or their descriptions provided in the paper. Integrated tasks were defined as tasks
with input texts that excluded task instructions, and academic tasks were defined as those
with topics/contents related to university studies. Table 1 shows the list of moderator
Running head: TASK AND RATER EFFECTS 14
variables coded (see lines below Number of studies). When necessary, every effort was made
to contact the authors to ask for details. A sample of 22.22% (n = 10) of the 45 studies was
separately examined by both the authors for the four elements mentioned above. The
agreement percentage was 98, and the kappa coefficient was 0.90. Disagreement was
resolved through discussion. The first author investigated the remaining studies. Only the
studies that met all four conditions were used for the analysis. We did not include Van Moere
(2006) and Bolus, Hinofotis, and Bailey (1982) because they used different designs from the
remaining datasets and only one dataset was provided for each design (i.e., person-by-rating-
by-occasion and person-by-rater-by-occasion designs, respectively); these factors precluded a
synthesis of the results. For the same reason, we did not include studies with nested facets
(e.g., Fulcher, 1993). Results from studies that used many-facet Rasch measurement were
excluded because they provided variances explained by persons, tasks, and raters, but not
those explained by task- and rater-related interactions (see Linacre, 2013). Included in our
synthesis were 28 datasets from 21 L2 speaking studies, and 22 datasets from 17 L2 writing
studies (50 datasets from 36 studies, since two studies had both speaking and writing
analyses). Those 36 studies are marked with an asterisk (*) in the references section.
[Insert Table 1 about here]
These 36 studies had one or more of the following facets: task, rater, rating, and
scoring criterion. A facet of rating is used as a proxy for a rater, for example, when a response
Running head: TASK AND RATER EFFECTS 15
from a single examinee is scored by multiple raters (e.g., Raters A and B), and another
response of the same examinee by multiple, different raters (e.g., Raters C and D); ratings
from Raters A and C are considered as Rating 1, as if both raters are exchangeable, and
ratings from Raters B and D are considered as Rating 2, as if both raters are exchangeable.
Subsequently, they are analyzed as a facet of rating (see Brown, 2011; Lee & Kantor, 2005).
The rating method is conceptually simpler by assuming that scores would be similar across
raters (Lin, 2014). Scoring criteria indicate the criteria in analytic scales, such as fluency and
accuracy.
3.3 Analyses
The analyses consisted of three stages. First, we used G-study design as a unit of
analysis. For example, when the results were reported in one article for person-by-task and
person-by-rater designs, both were coded. The results were not combined across designs or
skills (i.e., speaking and writing) because C. Huang (2009) reported that across-design and
across-skill aggregations affect the magnitude of variance components, and make the results
difficult to interpret. When multiple results from the same design were reported in one article,
one of the results was randomly selected. When results from proficiency-classified and
whole-level data were both reported, the whole-level data were used, since they are
considered more representative. Since we combined studies according to G-study design, in
which each dataset contributed only once, data dependency effects were kept to a minimum.
Running head: TASK AND RATER EFFECTS 16
Second, for each dataset, we coded the values of variance components from the
studies. When the reported values were from D studies, we multiplied the values by the
number of levels (e.g., the person-by-task-by-rater variance component was multiplied by the
number of tasks and raters). Thus, values in our synthesis mean those per single facet (e.g.,
per task and per rater), and effects of the number of levels in each facet (e.g., the number of
tasks and raters) are controlled. For standardization, we then calculated the percentage of
variance component for each facet. We calculated sample-size-weighted means and based our
interpretations on them because they are more precise than arithmetic means (e.g., C. Huang,
2009; Lipsey & Wilson, 2001). We also computed standard deviations (SDs) of these
percentages across datasets that employed the same design. We restricted our analysis to
descriptive statistics. Inferential statistics (e.g., homogeneity Q or I2 statistics) were not used
because of the small number of studies. We did not calculate confidence intervals for the
means because our datasets did not satisfy the normality assumption that is required to
compute standard errors (C. Huang, 2009). Moreover, confidence intervals were not
calculated because we obtained a small number of datasets, which hampered the use of
resampling methods.
Table 2 shows an example of synthesis of the variance components using the person-
by-task design datasets in speaking. We found four studies and calculated the percentage of
variance for each variance component, for each dataset, and then calculated the descriptive
Running head: TASK AND RATER EFFECTS 17
statistics. For example, the sample-size-weighted mean for person was 74.00% ([72.76 ×
1766 + 73.83 × 261 + 70.90 × 49 + 88.86 × 160]/[1766 + 261 + 49 + 160]), and the SD was
8.27%.
[Insert Table 2 about here]
Third, for person-by-task interactions, we computed means and SD of the percentages
of variance components for each moderator variable—task types, contexts, and scoring—to
examine the relationship between these percentages and moderator variables.
4 Results
4.1 Characteristics and Examples of the Datasets
The characteristics of the datasets and moderator or contextual variables considered to
affect the percentage of variance components are summarized in Table 1. The sample size
varied for speaking and writing, which was reflected in the large SDs and minimum-
maximum ranges. More tasks and raters were used in speaking than in writing, particularly
because of the large-scale speaking studies such as Banno (2008) and Lee, Golub-Smith,
Payton, and Carey (2001).2 The number of scoring criteria was larger for speaking studies
because Kondo (2010) included 24 criteria as a scoring criterion facet. While not shown in
2 For example, Banno (2008) used 130 raters to examine rater training effects and differences
in the ratings of teachers versus non-teachers. If Banno was excluded, the average number of
raters shrank to 5.38―a more typical number of raters used in such studies (see Table 1 Note
a).
Running head: TASK AND RATER EFFECTS 18
Table 1, examples of speaking tasks included TOEFL iBT independent and integrated tasks
(Lee, 2005) and integrated, role-play tasks (Sawaki, 2007). Examples of writing tasks
included independent, letter- and story-writing tasks (Bae & Bachman, 2010) and
independent, argumentative essay writing tasks (Barkaoui, 2007). For rater type and the
presence of rater training, see the rightmost column in the Appendix.
4.2 L2 Speaking
4.2.1 Percentages of Variance Component Explained by Each Facet
Table 3 shows the percentage of variance component explained by each facet, and the
interaction for speaking performance when one task, one rater, and/or one scoring criterion
were used. Large values are preferable for persons, while small values are usually preferable
for all other facets. The variance components in the person-by-task design were mainly
attributable to persons (74.00%) and person-by-task interactions with undifferentiated errors
(25.38%), while the remaining small percentage of variance was explained by tasks (0.63%).
Wider SDs of persons (8.27) and person-by-task interactions (7.84), than of tasks (0.50),
suggest more variability in such sources of variance. Please note that each design models
different sources of variance, and is not directly comparable unless the designs model the
same facet (e.g., the person x task design and the person x rater design are comparable only in
terms of persons). Results were similar for the person-by-rater and person-by-rating designs,
both of which showed that the variances were substantially attributable to persons rather than
Running head: TASK AND RATER EFFECTS 19
the person-by-rater or person-by-rating interactions. Tasks and raters (or ratings) explained a
small percentage of variance (0.63%–3.32%), suggesting that tasks were of similar difficulty
and that raters rated similarly.
[Insert Table 3 about here]
More complete pictures emerged from the remaining designs that operationalize both
task and rater facets—person-by-task-by-rater, person-by-task-by-rating, and person-by-rater-
by-criterion designs. Approximately 40% to 70% of the variance, across the three designs,
was attributable to persons. Raters contributed to some noticeable percentage of the variance
for the person-by-rater-by-criterion design (12.13%). This was rather surprising since the
raters or ratings were responsible for only a small amount of the total variance for the other
two designs that had a rater facet (0.05% and 1.91%).
Additionally, we observed noticeable interaction effects, including the person-by-task
interaction for the person-by-task-by-rater and person-by-task-by-rating designs (9.78% and
13.07%, respectively) and the person-by-rater interaction for the person-by-rater-by-criterion
design (10.57%). The three-way interactions including the undifferentiated errors explained
about 15.61% to 24.90% of the variances. Thus, rather than the tasks or raters per se, the
interaction effects involving persons, tasks, and raters contributed to the score variances.
When we compared task and rater effects in the person-by-task-by-rater and person-
by-task-by-rating designs, we found, on average, that the task effects were larger than the
Running head: TASK AND RATER EFFECTS 20
rater (or rating) effects in the person-by-task-by-rater (2.06% vs. 1.91%) and person-by-task-
by-rating (0.70% vs. 0.05%; both are negligibly small) designs. Further, the person-by-task
interaction effects were larger than the person-by-rater (or rating) interaction effects (9.78%
vs. 1.85%; 13.07% vs. 6.38%).
The person-by-rater-by-criterion design was different from the other designs with
reference to the noticeably large rater and person-by-rater effects observed. In this design, the
rater and rater-related interaction effects contributed more to the score variances than the
scoring criterion and criterion-related interaction effects (12.13% vs. 4.09%; 10.57% vs.
4.99%). The distinctive pattern of larger rater-related effects in this person-by-rater-by-
criterion design than in the other designs deserves attention and is interpreted in the
Discussion section.
4.2.2 Results From Moderator Variable Analysis
To identify possible causes of large variations reported in the previous section, we
conducted a moderator variable analysis on the person-by-task interactions for the person-by-
task-by-rater and the person-by-task-by-rating designs, both of which could separate person-
by-task interactions from undifferentiated errors, unlike the person-by-task design. We
classified results according to three moderator variables: task types, contexts, and scoring
methods. We interpreted the results with two or more datasets (see Table 4).
[Insert Table 4 about here]
Running head: TASK AND RATER EFFECTS 21
Regarding the person-by-task-by-rater design, Table 4 indicates that the variances
explained by the person-by-task interactions were, overall, larger for datasets using both
independent and integrated tasks (10.80%), based on the general context (13.01%) than those
using independent tasks only (6.41%), and those that were based on both general and
academic contexts (9.20%), and those based only on academic contexts (1.80%). Further, the
interactions were larger for analytic scoring (12.14%) than for holistic scoring (6.37%). In
contrast, for the person-by-task-by-rating design, these interactions were larger for the
datasets that used only integrated tasks (16.40%), based on both general and academic
contexts (17.43%), than those using both independent and integrated tasks (9.55%), based on
an academic context (2.23%). Further, analytic scoring (16.40%) seems to be associated with
larger person-by-task interactions than holistic scoring (10.95%).
Other than task types, contexts, and scoring types, we found another factor that may
explain large person-by-task interactions: task-specificity in analytic scoring criteria. Two
studies containing analytic scoring criteria with task-specific features (Grabowski, 2009; H.-J.
Kim, 2009) used instruments for assessing pragmatic competence and showed the largest
(17.33%) and third largest (7.49%) person-by-task interactions in the person-by-task-by-rater
design in case of the datasets used for this study (see Appendix). Please note that we
randomly selected one dataset from each study to avoid dependency; if we considered all
datasets from the two studies, they had the largest and second largest person-by-task
Running head: TASK AND RATER EFFECTS 22
interactions (63.30% and 38.16%, respectively). Grabowski (2009) used five analytic scoring
criteria characterized by task-specific language features: sociocultural appropriateness (e.g.,
metaphor; 63.30% of the variance explained by the person-by-task interactions),
psychological appropriateness (e.g., sarcasm, irony, anger; 57.65%), sociolinguistic
appropriateness (e.g., age, status, power, register; 47.64%), grammatical meaningfulness
(28.14%), and grammatical accuracy (17.33%). Similarly, H.-J. Kim (2009) used six analytic
scoring criteria: sociolinguistic competence (i.e., social appropriateness; 15.91%–38.16%),
task completion (12.75%–26.43%), meaningfulness (10.59%–19.47%), discourse competence
(6.70%–15.14%), grammatical competence (6.28%–12.90%), and intelligibility (4.44%–
10.92%). The larger person-by-task interactions in these two studies seem to indicate
relationships with analytic scoring with task-specific features.
4.2.3 Summary in L2 Speaking
We compared task and rater effects across studies, and observed the following:
Most of score variation reflected differences in the examinees’ performance in
all the designs (40.78%–86.16%).
Task effects were larger than rater (or rating) effects in the person-by-task-by-
rater (2.06% vs. 1.91%) and person-by-task-by-rating (0.70% vs. 0.05%)
designs.
The person-by-task interaction effects were larger than the person-by-rater (or
Running head: TASK AND RATER EFFECTS 23
rating) interaction effects (9.78% vs. 1.85%; 13.07% vs. 6.38%).
These interaction effects made larger contributions to the score variances when
compared with tasks or raters per se.
Two common findings across the designs were the following: (1) a larger
person-by-task interaction was related to assessments using both general and
academic contexts than those using academic contexts only; and (2) a larger
person-by-task interaction was related to analytic scoring than holistic scoring.
More specifically, a larger person-by-task interaction was related to analytic
scoring with task-specific features assessing pragmatic competence.
4.3 L2 Writing
4.3.1 Percentages of Variance Component Explained by Each Facet
Table 5 shows that most of the variance components in the person-by-task design
were roughly equally attributable to persons (44.83%) and person-by-task interactions with
undifferentiated errors (35.20%), whereas a smaller percentage of the variance was explained
by tasks (19.97%). Similar results were found for the person-by-rater design.
[Insert Table 5 about here]
The results of the three-way designs were also consistent. Regardless of the designs,
overall, the variances were explained by persons, two-way interactions (e.g., person-by-task
or person-by-rater), and three-way interactions with undifferentiated errors (i.e., person-by-
Running head: TASK AND RATER EFFECTS 24
task-by-rater, person-by-task-by-rating, and person-by-rater-by-criterion). For example, for
the person-by-task-by-rater design, the variance components were primarily attributable to
persons (64.09%), person-by-task interactions (14.53%), and person-by-task-by-rater
interactions (13.01%). Additionally, the variance components for the task were noticeable for
the person-by-task-by-rater and person-by-task-by-rating designs (5.49% and 9.46%,
respectively), as were the variance components for the rater for the person-by-rater-by-
criterion design (21.15%).
When we compared task and rater effects in the person-by-task-by-rater and person-
by-task-by-rating designs, we found task effects were larger than rater (or rating) effects in
the person-by-task-by-rater (5.49% vs. 1.13%) and person-by-task-by-rating (9.46% vs.
0.36%) designs. In addition, the person-by-task interaction effects were larger than the
person-by-rater (or rating) interaction effects (14.53% vs. 1.27%; 19.28% vs. 1.06%). Thus,
similar to the results for L2 speaking, the interaction effects involving tasks contributed more
to the score variances than did the raters. Unlike L2 speaking, however, the task facet also
contributed to the variances in the two designs. One unique feature of the person-by-rater-by-
criterion design was a noticeable rater effect and a person-by-rater effect, which was
consistent across L2 speaking and writing.
4.3.2 Results From Moderator Variable Analysis
Careful scrutiny of the relationship between the person-by-task interactions and study
Running head: TASK AND RATER EFFECTS 25
characteristics (Table 4) was difficult due to the small number of datasets. If we interpreted
results with two or more datasets, the results for the person-by-task-by-rating design suggest
that the person-by-task interactions were larger for the datasets based only on academic
contexts (23.89%), as compared to those based on both general and academic contexts
(18.58%).
4.3.3 Summary in L2 Writing
We compared the task and rater effects across studies and found that
Most of score variation reflected differences in the examinees’ performance in
all the designs except for the person-by-rater-by-criterion design (44.83%–
82.62%).
The task effects were larger than rater (or rating) effects in the person-by-task-
by-rater (5.49% vs. 1.13%) and person-by-task-by-rating (9.46% vs. 0.36%)
designs.
The person-by-task interaction effects were larger than the person-by-rater (or
rating) interaction effects (14.53% vs. 1.27%; 19.28% vs. 1.06%).
These interaction effects made larger contributions to the score variances as
compared with tasks or raters per se.
In one design, the person-by-task interactions were larger for the datasets based
only on academic contexts, rather than for those based on both general and
Running head: TASK AND RATER EFFECTS 26
academic contexts.
5 Discussion
Research Question 1: Which effects—task and task-related interaction effects or rater
and rater-related interaction effects—explained more of the score variances?
On average, the variance components were primarily explained by persons across L2
speaking and writing, although the relatively large SDs indicated varied effects of persons
across studies. The interaction effects related to tasks, raters, ratings, and scoring criteria
followed this. However, exceptions were observed for the person-by-rater-by-criterion design.
Specifically for this design, we observed smaller person variations and larger variations due
to effects of raters and person-by-rater interactions. One reason for this may be that a facet of
scoring criterion is not very typical because “a facet is simply a set of similar conditions of
measurement” (Brennan, 2001, p. 5), and different scoring criteria are primarily used to
assess different aspects of speaking and writing; thus, the person-by-rater-by-criterion design
may need to be analyzed as the person-by-rater design with scoring criteria as multivariate
variables in multivariate G theory. Therefore, the possible misspecification of the design may
have caused these inconsistent results. Another reason could be that multiple scoring criteria
alter rating processes, which could lead to larger rater-related variations. A third reason, as
suggested by a reviewer, could be the effect of unmodeled variables that might have affected
the proper estimation of other variance components. These possibilities need to be explored
Running head: TASK AND RATER EFFECTS 27
further before we suggest implications based on this finding.
We compared task and rater effects in the person-by-task-by-rater and person-by-task-
by-rating designs across L2 speaking and writing. The task effects were consistently larger
than rater (or rating) effects. In addition, the person-by-task interaction effects were larger
than the person-by-rater (or rating) interaction effects. Thus, task and task-related interaction
effects made larger contributions to the score variances than the rater and rater-related
interaction effects. Furthermore, the interaction effects were not negligible in all designs.
Compared with tasks or raters per se, interaction effects involving persons, tasks, raters (or
ratings and scoring criteria) made larger contributions to the score variances. These results
suggest the importance of considering interaction effects in test design, development, and
validation.
These findings, in general, concur with those of the previous studies reviewed above.
Brown (2011) reported that most studies showed a larger effect of the person-by-task
interactions than person-by-rater interactions, which is similar to the general trend of a larger
person-by-task interaction effect found in the current study. While the person-by-task
interactions were large on average in L2 speaking and writing (9.78%–19.28%), they were
not particularly large as compared to findings in other fields (e.g., 27.46% in L1 writing
reported in C. Huang, 2009). However, it should be noted that the larger impact of task-
related factors rather than that of rater-related factors may be restricted to cases in which rater
Running head: TASK AND RATER EFFECTS 28
training was (likely) conducted, because a majority of the datasets (79.07%, 34/43) included
in our synthesis were obtained from studies that involved rater training. This was evidenced
by the L2 speaking results in the person-by-task-by-rater design. When the percentage of the
variance components was calculated (not shown in tables) in the three studies without rater
training (see Appendix), the variance attributable to the person-by-rater interactions (9.01%;
SD = 3.06) was larger than that attributable to the person-by-task interactions (2.07%; SD =
1.83); rater variance (4.20%; SD = 3.44) , however, was smaller than task variance (13.97%;
SD = 11.25). Therefore, having obtained smaller rater-related variances than task-related
variances does not suggest that fewer resources should be spent on rater training; rather, the
person-by-rater interactions could be large without rater training. Additionally, variables other
than rater training, such as the quality of rating scales, may also be responsible.
Furthermore, L2 speaking and writing studies differed as follows. While rater effects
were overall small across modalities, a larger percentage of the variance was explained by the
task in L2 writing (5.49%–19.97%) than in L2 speaking (0.63%–2.06%). Similarly, the
person-by-task interactions explained a larger percentage of the variance in L2 writing than in
L2 speaking (14.53%–19.28% vs. 9.78%–13.07%). This means that the variability in rater
judgments was small overall, but the tasks differed in their levels of difficulty, and more so in
the case of L2 writing than in L2 speaking. Moreover, the ranking of the examinees differed,
depending on task difficulty, more in L2 writing. This may be explained by the use of tasks
Running head: TASK AND RATER EFFECTS 29
with greater diversity of constructs and difficulty. For example, Abeywickrama (2008)
reported a massive 50.75% variance for summary, gap-fill, and essay tasks (14.24% for
person, and 35.01% for person-by-task interactions with undifferentiated errors; partially
reported in Table 5) that measured cohesion using analytic rating scales. Designed to measure
different aspects of writing constructs, they varied in difficulty. As performance assessments
comprise only a few tasks, the selection of tasks greatly influences the difficulty and
measured construct.
Although the results across the person-by-task-by-rater and person-by-task-by-rating
designs in L2 speaking and writing were similar, one difference was the larger magnitude of
the person-by-task interactions in the person-by-task-by-rating design than in the person-by-
task-by-rater design. This suggests that the two designs, while producing similar results
overall, are not the same (13.07% vs. 9.78%; 19.28% vs. 14.53%). In the person-by-task-by-
rating design, more raters are involved in the rating processes. Consequently, small
divergences in ratings for each response in a task could add up to larger person-by-task
interactions.
In summary, answers to Research Question 1 are that task effects were larger than
rater (or rating) effects, person-by-task interaction effects were larger than person-by-rater (or
rating) interaction effects, and that such interaction effects were larger than task or rater
effects per se.
Running head: TASK AND RATER EFFECTS 30
Research Question 2: Are the degrees of the person-by-task interactions related to task
types, contexts, and scoring methods?
The synthesized results consistently showed that a high percentage of variance was
explained by the person-by-task interactions. A moderator variable analysis was conducted on
the person-by-task interactions for the person-by-task-by-rater and person-by-task-by-rating
designs. Since L2 writing had a limited number of datasets, we discuss results for L2
speaking with a specific focus on task types, contexts, and scoring methods. First, Xi and
Mollaun (2006) speculated that the person-by-task interactions would be larger for
assessments using both independent and integrated tasks, and for those based on both general
and academic contexts, than for those using only any one of the task or context types. This
speculation was partially supported across the designs in L2 speaking, where the person-by-
task interactions were larger for datasets based on both general and academic contexts, than
those based only on an academic context, across the two designs. In addition, in the person-
by-task-by-rater design, studies that used both independent and integrated tasks produced
larger interactions than did those that used only independent tasks. This indicates that a
broader construct definition, with more diverse task types and contexts, might increase the
person-by-task interactions. However, opposite trends were also observed. In the person-by-
task-by-rater design, datasets based on both general and academic contexts produced smaller
interactions than did those based only on a general context. Further, in the person-by-task-by-
Running head: TASK AND RATER EFFECTS 31
rating design, studies that used both independent and integrated tasks produced smaller
interactions than did those that used integrated tasks. Thus, in L2 speaking, Xi and Mollaun’s
(2006) prediction is only partially corroborated.
Second, regarding scoring methods, Xi and Mollaun (2006) speculated that the
person-by-task interactions would be smaller in holistic scoring because the task-specific
language features in the scoring rubrics would have less of an influence when scored
holistically, with more underlying stable abilities in mind. This was supported by the two
designs in L2 speaking with analytically scored datasets producing larger interactions. Thus,
scoring methods seemed to be systematically related to the size of the person-by-task
interactions.
Additionally, consistent with Xi and Mollaun (2006), the scrutiny of Grabowski
(2009) and H.-J. Kim (2009) suggests a relationship between the person-by-task interactions
and scoring criteria (or rubrics) that reflect task-specific language features. Performance
variability was particularly attributable to the analytic scoring criteria measuring sociocultural
appropriateness (e.g., metaphor), psychological appropriateness (e.g., sarcasm, irony, anger),
and sociolinguistic appropriateness (e.g., age, status, power, register). Tasks measuring these
constructs are highly contextualized. Further, tasks that measure metaphor or sarcasm are
particularly dependent on the discourse and the social contexts in which they are embedded.
As expected, examinees may have performed differently across tasks. This suggests that
Running head: TASK AND RATER EFFECTS 32
obtaining large person-by-task interactions is not necessarily undesirable or construct-
irrelevant, because it may indicate that contextual effects were well operationalized in the
tasks, and that the magnitudes of such interactions indicate the size of the contextual effects
on test performance. Thus, if an assessment instrument includes task-specific language
features in the analytic scoring criteria, such as those discussed above, researchers should
expect to observe large person-by-task interactions. Nevertheless, such large interactions
weaken the ability to generalize from the examinees’ performance in the sample of tasks, to
the universe consisting of all possible tasks in the performance assessment (i.e., the universe
of generalization; Kane, Crooks, & Cohen, 1999). We should be especially careful to define
the construct narrowly enough, to minimize performance variability, but broadly enough, to
ensure that domain representation is not undermined. This approach could prove useful for
large-scale assessments, where increasing the number of tasks would be difficult due to
logistic and time constraints (Xi & Mollaun, 2006).
In summary, in response to Deville and Chalhoub-Deville’s (2006), Schoonen’s
(2012), and Xi and Mollaun’s (2006) call for research into contextual features that relate to
person-by-task interactions, our results on L2 speaking indicate that analytic scoring is likely
to produce larger interactions than holistic scoring; analytic scoring criteria with more task-
specific language features generate larger variations. In addition, person-by-task interactions
are generally larger for assessments based on both general and academic contexts than for
Running head: TASK AND RATER EFFECTS 33
those based only on an academic context. Thus, among many contextual variables, contexts,
scoring methods, and scoring criteria may lead to varied performance over tasks, and require
test developers and researchers to pay close attention to these variables. Results of task types
were not consistent across the designs.
While these results do not appear to be surprising, they have not been empirically
examined since calls for research were raised by Deville and Chalhoub-Deville (2006),
Schoonen (2012), and Xi and Mollaun (2006). Further, the results that the person-by-task
interactions were generally larger for assessments based on a general context only than for
those based on both general and academic contexts, and based on integrated tasks only than
on both independent and integrated tasks in each design, were not in line with Xi and
Mollaun (2006) or, in general, with Fulcher (2003) and Schoonen (2012). They argued that
the inclusion of different contexts produces more variation in scores. Explaining these
inconsistent results across the studies clearly shows the complexity of factors affecting L2
speaking and writing, which requires a greater number of primary studies and research
syntheses.
Implications for language testers
Since we usually want to generalize test results across tasks, we need to reduce
variances in the person-by-task interactions. One approach is to increase the number of tasks;
Running head: TASK AND RATER EFFECTS 34
when the number of tasks is doubled, the task-related variances are halved, and we obtain
higher reliability (i.e., generalizability). When it is difficult to increase the number of tasks,
another way to accomplish this goal is to use similar tasks, as this increases the score
generalizability across tasks at the expense of narrowing down the construct definition. Our
findings for the effects of context suggest that when we intend to narrow the construct
definition in L2 speaking, one option is to employ only academic-context tasks rather than
using tasks from both general and academic contexts. Further, based on our results, the use of
holistic scoring over analytic scoring may be another option since it tends to provide fewer
variations, and thus, higher reliability, while maintaining the original construct definition. A
third method is to examine the construct we want to assess, the domain to which we want to
generalize the result, and the minimum reliability we should obtain to find a middle
ground―by adjusting the construct definition to the extent that we can maintain the domain
representation in an assessment. In this case, we should first examine whether the task-related
variations derived are large, and are related to the construct intended, such that it is preferable
to maintain them, like the large person-by-task variations observed in Grabowski’s (2009)
study of sociocultural appropriateness. In judging such relative magnitudes, our findings
about synthesized percentages of variances explained by tasks, raters, and their interactions
may serve as guidelines for judging whether a study’s percentages of variances explained by
tasks, raters, and their interactions, are comparatively large.
Running head: TASK AND RATER EFFECTS 35
6 Summary, Future Research, and Limitations
This study examined (a) whether task and task-related interaction effects or rater and
rater-related interaction effects explain more of the score variances, and (b) whether the
degrees of person-by-task interactions are related to task types, contexts, and scoring methods.
A synthesis of studies using G theory was conducted, and the results showed that the task and
task-related interactions were more influential than the rater and rater-related interactions.
Regarding (b), L2 speaking studies show that the person-by-task interactions were larger for
assessments based on both general and academic contexts than for those based on academic
contexts only. Similarly, they were larger for assessments using analytic, rather than holistic,
scorings, and for analytic scoring criteria with task-specific language features such as
sociocultural or sociolinguistic appropriateness. Some of these results correspond well with
Xi and Mollaun’s (2006) prediction, but at the same time, differing trends were also observed
in some designs, which suggests the complexity of factors influencing L2 speaking and
writing.
We discuss two directions for future research. First, in addition to increasing the
number of studies investigating the effects of tasks, raters and their moderator variables for
further syntheses, it would be of great interest to compare percentages of variance
components (especially person-by-task interactions), using the same participants’ responses,
between only independent tasks, only integrated tasks, and both tasks in general and/or
Running head: TASK AND RATER EFFECTS 36
academic contexts; between holistic and analytic scorings; and between speaking and writing.
We can analyze each separately or incorporate it as a facet in a generalizability study, and
examine how the person-by-task interactions would change in size, accordingly using studies
of similar designs. This method will enable researchers to hold the person facets constant, and
inspect the variables in focus. These studies are particularly needed, given the paucity of
studies on some designs observed during the current study, and that the task types
(independent vs. integrated) and contexts (general vs. academic) were often confounded.
Second, it is equally important to examine if the person-by-task interactions would be
larger for rating scales characterized by more richly contextualized and specific descriptors in
particular domains, as suggested by Xi and Mollaun (2006). Since such scales (e.g., Fulcher,
Davidson, & Kemp, 2011; Upshur & Turner, 1995) need to be designed for each task, they
are more likely to uniquely reflect the context and the complexities of language use. These
scales were not found among the datasets we analyzed, and investigation into this issue
would certainly hold promise for understanding the person-by-task interactions better.
The current study is limited in its narrow focus on synthesizing only studies on task
and rater effects that used G theory, while these effects have also been investigated using
other methods (e.g., Eckes, 2011; Knoch, 2011). The context, as defined in a G study, refers
to task characteristics such as independent/integrated task types or general/academic contexts,
and does not fully capture what Deville and Chalhoub-Deville (2006) call an “ability-in-
Running head: TASK AND RATER EFFECTS 37
language user-in-context” approach to defining a construct. While this new broader approach
attempts to examine the socially and cognitively dynamic nature of context, our focus on the
G-theory-based studies necessarily restricted our operationalization of the context. Further,
we conducted an exhaustive search for studies that used G-theory, and believe that the
collected studies included a reasonably representative sample of speaking and writing tasks.
However, this does not mean that these collected studies included a representative sample of
all speaking and writing tasks used in L2 language testing. We focused on studies using G-
theory, and we are aware of studies on task and rater effects using other analytical
frameworks. Finally, the number of datasets for certain G-study designs was not large enough
to examine, in detail, the percentages of variance components accounted for by each facet in
relation to the moderator variables. In particular, a paucity of G studies in writing
assessments examining the person-by-task interactions suggests the need for more studies to
better understand the relative effects of various facets, and the way contextual variables
moderate these effects.
Nevertheless, we hope that the current study acts as a springboard for more inquiry
into the task and rater effects in L2 speaking and writing.
Acknowledgement
An earlier version of this paper was presented at the 2013 American Association for
Applied Linguistics Conference in Dallas, Texas, USA and at the 2013 Language Testing
Running head: TASK AND RATER EFFECTS 38
Research Colloquium Conference in Seoul, South Korea. We thank Rob Schoonen for
helping us interpret our results, Yuko Hijikata and Wei-Li Hsu for locating and retrieving
some studies for the current synthesis, Yujia Zhou for assisting us in coding the articles
written in Chinese, and the Editor and two anonymous reviewers for their valuable comments
on earlier versions of this paper. This work was partially funded by the Japan Society for the
Promotion of Science (JSPS) KAKENHI, Grant-in-Aid for Scientific Research (C), Grant
Number 26370736 and 26370737. The authors contributed equally to this work. The coding
tables are available in The University of York & Georgetown University (n.d.).
References
References marked with an asterisk indicate articles included in the synthesis.
*Abdul Kadir, K. (2008). Framing a validity argument for test use and impact: The
Malaysian public service experience (Doctoral dissertation). Retrieved from ProQuest.
(AAT 3337680)
*Abeywickrama, P.-S. (2008). Measuring the knowledge of textual cohesion and coherence in
learners of English as a second language (ESL) (Doctoral dissertation). Retrieved from
ProQuest. (AAI 3288206)
*Alharby, E. R. (2006). A comparison between two scoring methods, holistic vs. analytic,
using two measurement models, the Generalizability Theory and the Many-facet Rasch
Measurement, within the context of performance assessment (Doctoral dissertation).
Running head: TASK AND RATER EFFECTS 39
Retrieved from ProQuest. (AAT 3236860)
Bachman, L. F. (2007). What is the construct? The dialectic of abilities and contexts in
defining constructs in language assessment. In J. Fox, M. Wesche, D. Bayliss, L. Cheng,
C. E. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 41–71). University
of Ottawa Press.
*Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and
rater judgments in a performance test of foreign language speaking. Language Testing,
12, 239–257.
Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford, UK: Oxford
University Press.
*Bae, J., & Bachman, L. F. (2010). An investigation of four writing traits and two tasks
across two languages. Language Testing, 27, 213–234.
*Banno, E. (2008). Investigating an oral placement test for learners of Japanese as a second
language (Doctoral dissertation). Retrieved from ProQuest. (AAI 3300329)
*Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study.
Assessing Writing, 12, 86–107.
Bolus, R. E., Hinofotis, F. B., & Bailey, K. M. (1982). An introduction to generalizability
theory in second language research. Language Learning, 32, 245–258.
Brennan, R. L. (1996). Generalizability of performance assessments. In Phillips (Ed.).
Running head: TASK AND RATER EFFECTS 40
Technical issues in large-scale performance assessment (pp. 19–58). Washington, DC:
National Center for Educational Statistics.
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.
Brown, A. (Ed.). (2012). ILTA bibliography of language testing (6th Edition, 1999–2011) By
Category. Retrieved from
http://www.iltaonline.com/images/pdfs/2011_by_category.pdf
Brown, J. D. (2011). What do the L2 generalizability studies tell us? International Journal of
Assessment and Evaluation in Education, 1, 1–37.
*Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language
writing skills. Language Learning, 34, 21–38.
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F.
Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and
language testing research (pp. 32–70). New York, NY: Cambridge University Press.
Deville, C., & Chalhoub-Deville, M. (2006). Old and new thoughts on test score variability:
Implications for reliability and validity. In M. Chalhoub-Deville, C. A. Chapelle, & P.
Duff (Eds.), Inference and generalizability in applied linguistics: Multiple perspectives
(pp. 9–25). Amsterdam, the Netherlands: John Benjamins.
*Eckes, T. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating
rater-mediated assessments. Frankfurt am Main: Peter Lang.
Running head: TASK AND RATER EFFECTS 41
Fulcher, G. (1993). The construction and validation of rating scales for oral tests in English
as a foreign language. Unpublished PhD dissertation, University of Lancaster.
Fulcher, G. (2003). Testing second language speaking. Essex, U.K.: Pearson Education
Limited.
Fulcher, G., & Davidson, F. (2012). The Routledge handbook of language testing. New York:
Routledge.
Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking
tests: Performance decision trees. Language Testing, 28, 5–29.
Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit
it all? Language Testing, 26, 507–531.
*Grabowski, K. C. (2009). Investigating the construct validity of a test designed to measure
grammatical and pragmatic knowledge in the context of speaking (Doctoral
dissertation). Retrieved from ProQuest. (AAI 3368256)
*Hirai, A., & Koizumi, R. (2008). Validation of an EBB scale: A case of the Story Retelling
Speaking Test. JLTA (Japan Language Testing Association) Journal, 11, 1–20.
Huang, C. (2009). Magnitude of task-sampling variability in performance assessment: A
meta-analysis. Educational and Psychological Measurement, 69, 887–912.
*Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale
assessments? A generalizability theory approach, Assessing Writing, 13, 201–218.
Running head: TASK AND RATER EFFECTS 42
*Huang, J. (2012). Using generalizability theory to examine the accuracy and validity of
large-scale ESL writing assessment. Assessing Writing, 17, 123–139.
ILTA Bibliography of PhDs in Language Testing. (2011). Retrieved from
http://www.iltaonline.com/images/pdfs/phds_2010.pdf
In’nami, Y., & Koizumi, R. (2009). A meta-analysis of test format effects on reading and
listening test performance: Focus on multiple-choice and open-ended formats.
Language Testing, 26, 219–244. doi:10.1177/0265532208101006
In’nami, Y., & Koizumi, R. (2010). Database selection guidelines for meta-analysis in applied
linguistics. TESOL Quarterly, 44, 169–184. doi:10.5054/tq.2010.215253
In’nami, Y., & Koizumi, R. (2014). Research synthesis and meta-analysis in second language
learning and testing. English Teaching & Learning, 38(3), 1–27.
doi:10.6330/ETL.2014.38.3.01
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational
Measurement: Issues and Practice, 18(2), 5–17.
*Kim, H.-J. (2009). Investigating the effects of context and task type on second language
speaking ability (Doctoral dissertation). Retrieved from ProQuest. (AAT 3368349)
*Kim, Y.-H. (2009). A G-theory analysis of rater effect in ESL speaking assessment. Applied
Linguistics, 30, 435–440.
*Kinshi, K., Kuru, Y., Masaki, M., Yamanishi, H., & Otoshi, J. (2011). Revising a writing
Running head: TASK AND RATER EFFECTS 43
rubric for its improved use in the classroom. LET Kansai Chapter Collected Papers, 13,
113–124.
Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating
behavior–a longitudinal study. Language Testing, 28, 179–200.
*Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2
performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14,
1–23.
*Lee, Y.-W. (2005). Dependability of scores for a new ESL speaking test: Evaluating
prototype tasks (Monograph Series, MS-28). Retrieved from
http://www.ets.org/Media/Research/pdf/RM-04-07.pdf
Lee, Y.-W., Breland, H., Muraki, E. (2004). Comparability of TOEFL CBT prompts for
different native language groups (RR-04-24). Retrieved from
http://www.ets.org/Media/Research/pdf/RR-04-24.pdf
*Lee, Y.-W., Golub-Smith, M., Payton, C., & Carey, J. (2001). The score reliability of the Test
of Spoken English (TSE) from the generalizability theory perspective: Validating the
current procedure. Retrieved from ERIC database. (ED458241)
*Lee, Y.-W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating
prototype task and alternative rating schemes (Monograph Series, MS-31). Retrieved
from http://www.ets.org/Media/Research/pdf/RR-05-14.pdf
Running head: TASK AND RATER EFFECTS 44
Lin, C.-K. (2014). Treating either ratings or raters as a random facet in a performance-based
language assessments: Does it matter? CaMLA Working Papers 2014-01. Cambridge
Michigan Language Assessments. Retrieved from
http://www.cambridgemichigan.org/sites/default/files/resources/workingpapers/CWP-
2014-01.pdf
Linacre, J. M. (2013). A user’s guide to FACETS: Rasch-Model computer programs (Program
manual 3.71.0). Retrieved from http://www.winsteps.com/a/facets-manual.pdf
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Lumley, T., & O’Sullivan, B. (2005). The effect of test-taker gender, audience and topic on
task performance in tape-mediated assessment of speaking. Language Testing, 22, 415–
437.
Luoma, S. (2004). Assessing speaking. Cambridge, UK: Cambridge University Press.
*Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and Many-facet Rasch
measurement in the development of performance assessments of the ESL speaking
skills of immigrants. Language Testing, 15, 158–180.
*Mizumoto, A. (2008). Jiyu eisakubun ni okeru hyoteisha hyoka no shurui to shinraisei
[Types of evaluation by raters and reliability in an English essay]. The Institute of
Statistical Mathematics cooperative research report, 215, 43–49.
*Molloy, H., & Shimura, M. (2005). An examination of situational sensitivity in medium-
Running head: TASK AND RATER EFFECTS 45
scale interlanguage pragmatics research. In T. Newfields, Y. Ishida, M. Chapman, & M.
Fujioka (Eds.), Proceedings of the May 22–23, 2004 JALT Pan-SIG Conference (pp.
16–32). Retrieved from http://www.jalt.org/pansig/2004/HTML/ShimMoll.htm
*Nekoda, H. (2006). Developing a standard of language proficiency required for English
teachers: Using generalizability theory and item response theory. Annual Review of
English Language Education in Japan, 17, 191–200.
Norris, J. M., & Ortega, L. (2006). The value and practice of research synthesis for language
learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on
language learning and teaching (pp. 3–50). Philadelphia, PA: John Benjamins.
*Ohkubo, N. (2006). Shido to hyoka no ittaika wo mezashita shinraisei no takai eisakubun
hyoka kijun hyo no sakusei: Tahenryo ippanka kanousei riron wo mochiite
[Development of a reliable and valid scale of English writing assessment using
multivariate generalizability theory]. STEP Bulletin, 18, 14–29.
Ortega, L. (2008). Understanding second language acquisition. New York: Routledge.
Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and
challenges. Annual Review of Applied Linguistics, 30, 85–110.
Robinson, P. (Ed.). (2012). The Routledge encyclopedia of second language acquisition. New
York: Routledge.
*Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment:
Running head: TASK AND RATER EFFECTS 46
Reporting a score profile and a composite. Language Testing, 24, 355–390.
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation
modeling. Language Testing, 22, 1–30.
Schoonen, R. (2012). The generalisability of scores from language tests. In G. Fulcher & F.
Davidson (Eds.), The Routledge handbook of language testing (pp. 363–377). New
York: Routledge.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Thousand Oaks,
CA: Sage.
Shohamy, E., & Hornberger, N. H. (Eds.). (2010). Encyclopedia of language and education:
Volume 7: Language testing and assessment. New York: Springer.
Skehan, P. (1998). A cognitive approach to language learning. Oxford, UK: Oxford
University Press.
*Stansfield, C. W., & Kenyon, D. M. (1992a). Research on the comparability of the oral
proficiency interview and the simulated oral proficiency interview. System, 20, 347–364.
*Stansfield, C. W., & Kenyon, D. M. (1992b). The development and validation of a
Simulated Oral Proficiency Interview. Modern Language Journal, 76, 129–141.
*Tang, X. (2006). Investigating the score reliability of the English as a foreign language
performance test (Doctoral dissertation). Retrieved from ProQuest. (MR18823)
The University of York & Georgetown University. (n.d.). IRIS: A digital repository of data
Running head: TASK AND RATER EFFECTS 47
collection instruments for research into second language learning and teaching.
Retrieved from http://www.iris-database.org/iris/app/home/index
Upshur J., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT
Journal, 49, 3–12.
van Moere, A. (2006). Validity evidence in a university group oral test. Language Testing, 23,
411–440.
*Wang, H. (2010). Investigating the justifiability of an additional test use: An application of
assessment use argument to an English as a foreign language test (Doctoral
dissertation). Retrieved from ProQuest. (AAT 3441468)
*Wang, L., Eignor, D., & Enright, M. K. (2008). A final analysis. In C. A. Chapelle, M. K.
Enright, & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as
a Foreign Language (pp. 259–318). New York: Routledge.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15,
263–287.
*Xi, X. (2003). Investigating language performance on the graph description task in a semi-
direct oral test (Doctoral dissertation). Retrieved from ProQuest. (3100694)
Xi, X. (2007). Evaluating analytic scoring for the TOEFL®
Academic Speaking Test (TAST)
for operational use. Language Testing, 24, 251–286.
Xi, X. (2015, March). Language constructs revisited for practical test design, development
Running head: TASK AND RATER EFFECTS 48
and validation. Paper presented at the 37th Language Testing Research Colloquium,
Toronto, Ontario, Canada.
*Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL
Academic Speaking Test (TAST). (TOEFL iBT Research Report, RR-06-07). Retrieved
from http://www.ets.org/Media/Research/pdf/RR-06-07.pdf
*Xi, X., & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test.
Language Learning, 61, 1222–1255.
*Yamanishi, H. (2005). Ippanka kanousei riron wo mochiita kokosei no jiyueisakubun hyoka
no kento [Using Generalizability Theory in the evaluation of L2 writing]. JALT Journal,
27, 169–185.
*Zhou, Y. (2012). Generalizability of scores on structured and constructed-response tasks in
computer-delivered speaking assessment. Manuscript submitted for publication.
Running head: TASK AND RATER EFFECTS 49
Appendix The Moderator Variables in the Person-by-Task-by-Rater (Rating) Designs in Speaking and Writing Author(s), N pxt (Range for
all datasets)a
Study context; examinees’ L1/L2
Publication status; G-study type
Task type; context; scoring type (Analytic scoring criteria)
Rater type; training
Speaking p x t x r
Grabowski (2009), 102
17.33 (17.33–63.30)
Second; varied & E Dissertation; multivariate
Independent; general; analytic (Grammatical accuracy and other 4 criteria)
Native; yes
Xi & Mollaun (2006), 140
13.75 (13.73–21.27)
Second; varied & E Published; multivariate
Both; both; analytic (Delivery and other 2 criteria)
Both; yes
H.-J. Kim (2009), 162
7.49 (4.44–38.16)
Second; varied & E Dissertation; multivariate
Independent, integrated; both; analytic (Grammatical competence and other 5 criteria)
Unknown; yes
Stansfield & Kenyon (1992a), 40
6.80 (0.00–9.00)
Varied; varied & Hebrew
Published; univariate
Independent; general; holistic Unknown; maybe yes
Xi & Mollaun (2011), 100
6.68 (6.68–12.55)
Second; varied & E Published; univariate
Both; both; holistic Nonnative; yes
Stansfield & Kenyon (1992b), 16
4.51 (3.85–5.16)
Foreign; unknown & Indonesian
Published; univariate
Independent; general; holistic Both; maybe yes
Xi (2003), 20 3.91 Varied; varied & E Dissertation; univariate
Independent; both; holistic Unknown; yes
Banno (2008), 6
3.64 for holistic 4.73 for analytic
Second; Chinese & Japanese
Dissertation; univariate
Independent; general; holistic and analytic (5 criteria)
Native; no
Nekoda (2006), 26
2.48 Foreign; Japanese & E
Published; univariate
Independent; academic; holistic
Nonnative; no
Y.-H. Kim (2009), 10
0.06 (0.06–0.28)
Second; Korean & E
Published; univariate
Independent; academic; holistic
Both; no
(Continued)
Running head: TASK AND RATER EFFECTS 50
Appendix The Moderator Variables in the Person-by-Task-by-Rater (Rating) Designs in Speaking and Writing (Continued) Author(s), N pxt (Range for
all datasets)a
Study context; examinees’ L1/L2
Publication status; G-study type
Task type; context; scoring type (Analytic criteria)
Rater type; training
Speaking p x t x r’
Tang (2006), 1099
21.97 (11.00–23.50)
Foreign; Chinese & E
Dissertation; univariate
Integrated; both; analytic Unknown; yes
Lee (2005), 261
17.27 Both; varied & E Published; univariate
Both; both; holistic Unknown; maybe yes
Lee et al. (2001), 1,766
11.45 (11.45–12.74)
Second; varied & E Unpublished; univariate
Independent; general; holistic Native; maybe yes
Wang et al. (2008), 373
4.00 Second; varied & E Published; univariate
Both; both; holistic Native; yes
Sawaki (2007), 214
3.45 (0.00–3.45)
Foreign; English & Spanish
Published; multivariate
Integrated; academic; analytic (Pronunciation and other 4 criteria)
Both; yes
Bachman et al. (1995), 218
1.04 Foreign; varied & Spanish
Published; univariate
Integrated; academic; analytic (Grammar)
Both; yes
Writing p x t x r
Lee & Kantor (2005), 162
17.93 Both; varied & E Published; univariate
Both; both; holistic Unknown; yes
Bae & Bachman (2010), 317
13.27 (6.41–24.69)
Foreign; Korean & E
Published; univariate
Independent; general; analytic (Content and grammar)
Unknown; unknown
Barkaoui (2007), 16
5.01 for holistic
b
Foreign; unknown & E
Published; univariate
Independent; general; holistic and analytic (5 criteria)
Both; no
p x t x r’ J. Huang (2008), 323
32.97 (26.55–39.25)
Second; varied & E Published; univariate
Both; academic; holistic Unknown; maybe yes
Lee & Kantor (2005), 488
20.00 Both; varied & E Published; univariate
Both; both; holistic Unknown; yes
Wang et al. (2008), 2677
18.32 Second; varied & E Published; univariate
Both; both; holistic Native; maybe yes
J. Huang (2012), 154
4.84 (3.57–10.11)
Second; varied & E Published; univariate
Integrated; academic; analytic Unknown; yes
Note. Underlined = randomly selected and used for the synthesis in Tables 3 or 5. E = English. aThis is reported to show that our
random selection of one dataset among several within a single study did not considerably affect the results. b37.58 for analytic.
Running head: TASK AND RATER EFFECTS 51
Table 1 The Characteristics of the Datasets and Moderator Variables Speaking (k = 28) Writing (k = 22) Number of examinees Mean (SD) 409.61 (658.63) 308.14 (594.03) Minimum-maximum 6–2305 16–2677
Number of tasks (k = 20) (k = 10) Mean (SD) 6.00 (4.12) 2.16 (2.46) Minimum-maximum 2–12 1–12
Number of raters (k = 15) (k = 15) Mean (SD) 14.29
(33.59)
a 3.22 (1.96)
Minimum-maximum 2–130a 2–10
Number of ratings (k = 9) (k = 4) Mean (SD) 3.33 (3.62) 2.50 (1.00) Minimum-maximum 2–12 2–4
Number of scoring criteria (k = 2) (k = 8) Mean (SD) 13.50 (14.85) 6.13 (3.72) Minimum-maximum 3–24 3–15
Number of studies G-study design:
Person x task 4 (see Table 2) 3b
Person x rater 3c 4
d
Person x rating 3e 0
Person x task x rater 10 (see Appendix) 3 (see Appendix) Person x task x rating 6 (see Appendix) 4 (see Appendix) Person x rater x criterion 2
f 8
g
L2 research context: Second/ foreign language/both
16, 8, 4
11, 8, 3
Examinees’ L1: Varied/ Japanese/Chinese/Arabic/others
18, 3, 2, 0, 2
8, 5, 1, 2, 1
Examinees’ L2: English/Spanish/others
22, 2, 4
21, 0, 1
Scoring method: Holistic vs. analytic
18, 10
6, 16
Publication status: Published/Dissertation/ Unpublished
16, 8, 4 14, 8, 0
Type of G study: Uni- vs. multivariate
21, 7 20, 2
Task: Independent/integrated/both (k = 20) 11, 4, 5 (k = 10) 4, 2, 4 General/academic/both/occupational 6, 4, 9, 1 4, 2, 3, 1
Rater: Native/nonnative/both (k = 24) 9, 4, 4 (k = 19) 5, 2, 1 Training: Yes/Maybe yes
h/No 13, 8, 3 10, 3, 4
Note. The totals do not always equal the total number of datasets since some did not report such information. k = the number of datasets.
aStatistics excluding Banno (2008)
are as follows: M = 5.38, SD = 4.56, range = 2–14. b(Abdul Kadir, 2008; Abeywickrama,
2008; Molloy & Shimura, 2005). c(Abdul Kadir, 2008; Grabowski, 2009; Stansfield &
Kenyon, 1992a). d
(Abdul Kadir, 2008; Abeywickrama, 2008; Alharby, 2006; Wang, 2010). e(Lee et al., 2001; Wang et al., 2008; Zhou, 2012).
f(Kondo, 2010; Lynch & McNamara,
1998). g(Abdul Kadir, 2008; Alharby, 2006; Brown & Bailey, 1984; Eckes, 2011; Kinshi
et al., 2011; Mizumoto, 2008; Ohkubo, 2006; Yamanishi, 2005). hStudies that did not
mention rater training but probably used it, considering the standard rating procedures of the test (e.g., rating the TOEFL iBT tasks).
Running head: TASK AND RATER EFFECTS 52
Table 2 The Datasets for the Person-by-Task Design in Speaking
person (p) task (t) p x t Lee et al. (2001, N = 1766) 72.76 0.69 26.55 Lee (2005, N = 261) 73.83 0.52 25.65 Hirai & Koizumi (2008, N = 49) 70.90 1.20 27.90 Abdul Kadir (2008, N = 160) 88.86 0.00 11.14 Sample-size-weighted mean 74.00 0.63 25.38 SD 8.27 0.50 7.84
Running head: TASK AND RATER EFFECTS 53
Table 3 The Synthesized Percentages of the Variance Components for Speaking person (p) x task (t) (k = 4)
p t p x t NM 74.00 0.63 25.38 SD 8.27 0.50 7.84 Min 70.90 0.00 11.14 Max 88.86 1.20 27.90
Person (p) x rater (r) (k = 3)
p r p x r NM 86.16 3.32 10.51 SD 6.63 2.79 3.88 Min 80.15 0.10 6.61 Max 93.30 5.68 14.17
Person (p) x rating (r’) (k = 3)
p r’ p x r’ NM 85.68 1.15 13.17 SD 0.56 0.83 0.323 Min 85.48 0.00 12.86 Max 86.56 1.66 13.44
Person (p) x task (t) x rater (r) (k = 10)
p t R p x t p x r t x r p x t x r NM 67.74 2.06 1.91 9.78 1.85 1.05 15.61 SD 16.16 6.66 2.52 5.24 4.36 3.65 8.65 Min 33.35 0.00 0.00 0.07 0.00 0.00 4.40 Max 92.45 21.47 6.46 17.33 11.38 11.70 27.88
Person (p) x task (t) x rating (r’) (k = 6)
p t r’ p x t p x r’ t x r’ p x t x r’ NM 57.82 0.70 0.05 13.07 6.38 0.26 21.72 SD 19.91 0.72 0.05 8.49 4.27 0.38 11.03 Min 51.26 0.00 0.00 1.04 0.00 0.00 4.00 Max 92.00 1.84 0.12 21.97 10.35 0.92 31.72
Person (p) x rater (r) x scoring criterion (c) (k = 2)
p r c p x r p x c r x c p x r x c NM 40.78 12.13 4.09 10.57 4.99 2.52 24.90 SD 3.45 7.26 1.01 6.09 0.19 3.86 0.068 Min 38.18 6.67 3.42 6.53 4.85 0.00 24.85 Max 43.06 16.93 4.85 15.15 5.12 5.45 24.95
Note. k = Number of datasets; NM = Sample-size-weighted mean; Min = Minimum; Max = Maximum. Boldfaced and italicized figures are those related to research questions from our study. This note also applies to Tables 4 and 5.
Running head: TASK AND RATER EFFECTS 54
Table 4 The Characteristics of Task Types, Contexts, and Scoring and Synthesized Percentages of the Variance Components of the Person-by-Task Interactions
Task type Context Scoring type
Inde-
pend-
ent
Inte-
grated
Both Gen-
eral
Aca-
demic
Both Holis-
tic
Ana-
Lytic
Speaking k 8 0 2 4 2 4 7 3
Person (p)
x task (t)
x rater (r)
NM 6.41 10.80 13.01 1.80 9.20 6.37 12.14
SD 5.22 5.00 6.32 1.71 4.15 2.36 4.98
Min 0.06 6.68 3.64 0.06 3.91 0.06 7.49
Max 17.33 13.75 17.33 2.48 13.75 6.80 17.33
Person (p)
x task (t)
x rating
(r’)
k (1) 3 2 (1) 2 3 3 3
NM (11.45) 16.40 9.55 (11.45) 2.23 17.43 10.95 16.40
SD NA 11.5 9.53 NA 1.71 9.35 6.75 11.5
Min (11.45) 1.04 4.00 (11.45) 1.04 4.00 4.00 1.04
Max (11.45) 21.97 17.27 (11.45) 3.45 21.97 17.27 21.97
Writing k 2 0 (1) 2 0 (1) 2 (1)
Person (p)
x task (t)
x rater (r)
NM 12.87 (17.93) 12.87 (17.93) 16.77 (13.27)
SD 5.36 NA 5.36 NA 7.82 NA
Min 5.01 (17.93) 5.01 (17.93) 5.01 (13.27)
Max 13.27 (17.93) 13.27 (17.93) 17.93 (13.27)
Person (p)
x task (t)
x rating
(r’)
k 0 (1) 3 0 2 2 3 (1)
NM (4.84) 19.91 23.89 18.58 19.91 (4.84)
SD NA 8.02 19.89 1.19 8.02 NA
Min (4.84) 18.32 4.84 18.32 18.32 (4.84)
Max (4.84) 32.97 32.97 20.00 32.97 (4.84)
Note. When only one dataset was available, we bracketed it to show that we did not interpret it.
Running head: RATER AND TASK EFFECTS 55
Table 5 The Synthesized Percentages of the Variance Components for Writing Person (p) x task (t) (k = 3)
p T p x t NM 44.83 19.97 35.20 SD 39.55 26.05 21.48 Min 14.24 0.45 8.93 Max 90.61 50.75 51.54
Person (p) x rater (r) (k = 4)
p R p x r NM 82.62 4.99 12.40 SD 28.40 14.01 15.66 Min 28.42 0.05 3.60 Max 96.00 29.74 41.84
Person (p) x task (t) x rater (r) (k = 3)
p T r p x t p x r t x r p x t x r NM 64.09 5.49 1.13 14.53 1.27 0.50 13.01 SD 23.75 4.14 2.71 6.54 0.77 0.81 28.55 Min 26.23 0.00 0.77 5.01 0.00 0.00 6.63 Max 73.72 4.34 5.73 17.93 1.38 1.43 61.60
Person (p) x task (t) x rating (r’) (k = 4)
p T r’ p x t p x r’ t x r’ p x t x r’ NM 60.85 9.46 0.36 19.28 1.06 0.10 8.90 SD 13.83 6.42 1.14 9.53 15.97 0.00 12.67 Min 40.66 1.10 0.00 4.84 0.00 0.00 3.05 Max 67.94 10.77 3.30 32.97 22.58 1.10 29.23
Person (p) x rater (r) x scoring criterion (c) (k = 8)
p R c p x r p x c r x c p x r x c NM 16.24 21.15 3.77 10.75 4.83 3.46 39.80 SD 19.78 22.50 3.80 3.15 4.85 7.90 23.61 Min 0.15 0.00 0.19 7.16 0.09 0.00 9.14
Max 60.00 61.10 11.91 17.41 13.29 22.15 77.74