
Greta Gorsuch
|

Dale Griffee
|
Introduction and Description of a Rater Training Design and Evaluation Model
In our previous article, “Evaluating
and Improving Rater Training for ITA Performance Tests”
(Gorsuch, Florence, & Griffee, 2016), we recounted empirical
evidence that led us to believe our rater training for our teaching
simulation test (ITA Performance Test V.9) needed to be improved. We
believed raters were uncertain about the rating criteria. We also
believed that rater background factors (native-speaker vs
nonnative-speaker status, level of education, and level of experience
with ITA learner populations), over which we had little control, had
undue negative effects on the reliability of our ratings. This seemed a
critical issue, as the ITA Performance Test is a high-stakes assessment.
Using the literature from mainstream evaluation (e.g., Cizek &
Bunch, 2007; Hambleton, 2008) and from our own field of L2 testing and
evaluation, we created a proto-model with the following components,
listed in roughly chronological order:
- Demonstrable clarity of the test constructs, the test
task, and performance standards. Demonstratable = easy to teach to
others.
- Raters have appropriate backgrounds.
- Rater training procedure has essential features including:
repeated and iterative experience with real performance data,
quantitative and qualitative feedback, discussion.
- Raters are queried on the rater training and responses are compiled and used to make decision points.
- Ratings are analyzed quantitatively for interrater and intrarater consistency.
We created new training materials that we hoped had
“demonstrable clarity,” including a new scoring system that had input from
the raters (ITA Performance Test V. 10.1 and ITA Performance V. 10.1
Training Descriptors). We conducted the training using essential
features including repeated and iterative experience with real
performance data, various forms of feedback, and rater-directed
discussion (for details see Gorsuch et al., 2016). In this article, we
report on the final two components of the model, which compose the
evaluation component of the rater training. In other words, how did we
do?
Results of Rater Training Questionnaire
We sent an electronic questionnaire out in the middle of our
intensive 3-week workshop in July and August 2015. The ITA Performance Test is
given twice during the workshop, once after 5 days as a practice and
feedback opportunity, and once again after 10 days as one of the three
final measures used to make decisions about whether an ITA candidate is
approved to teach or not. Thus, the questionnaire was sent out 2 or 3
days after raters had rated 70 candidates on the practice ITA
Performance Test. We had multiple purposes. First, we wanted to identify
any significant questions raters continued to have about the test
criteria. Second, we wanted to encourage raters’ cognitive review of the
ITA Performance Test criteria (see component #3 of the model). And
third, we wanted to evaluate the rater training we had at the beginning
of the workshop. Table 1 shows what we asked and what we learned.
Table 1. Results of the rater training questionnaire
We had the
practice ITA Performance Test early this week. Do you have any initial comments about
the test? (5 answered, 2 skipped)
· I'm glad the presentations were 10 minutes long. It
was much less stressful. I had more time to pay attention to the
different criteria. I'd like some clarification on how to rate criteria
#8 [audience noncomprehension awareness] and
#10 [handling questions].
· I felt that some students needed more time for
feedback from the instructors. Overall, administrating a practice
performance test is an excellent procedure to help them succeed in this
workshop and teaching career in general :)
· I think that the practice performance test was very
effective and useful to the ITA candidates.
· The test is comprehensive and clear. It provides a
rater with reliable assessment measures.
· Need to work on getting other students to ask
questions. I ended up asking most of the questions. |
To what extent
do you believe the rater training prepared you to rate ITA candidates on
the practice ITA Performance Test?
7 respondents = Very well
0 = Moderately well
0 = Not very well |
This rating
scale is about how clear you feel about rating specific criteria of the
ITA Performance Test. |
I feel clear rating learners
on the word-level pronunciation criterion.
7 respondents = Yes
0 = Sort of
0 = No
I feel clear rating learners on the word stress pronunciation criterion.
7 respondents = Yes
0 = Sort of
0 = No
I feel clear rating learners on the thought groups criterion.
7 respondents = Yes
0 = Sort of
0 = No
I feel clear rating learners on the grammatical structures criterion.
5 respondents = Yes
2 = Sort of
0 = No
I feel clear rating learners on the transitional phrases criterion.
5 respondents = Yes
2 = Sort of
0 = No |
I feel clear rating learners
on the definitions and examples criterion.
5 respondents = Yes
2 = Sort of
0 = No
I feel clear rating learners on the prominence criterion.
6 respondents = Yes
1 = Sort of
0 = No
I feel clear rating learners on the audience non-comprehension awareness criterion.
5 respondents = Yes
2 = Sort of
0 = No
I feel clear rating learners on the tone choices criterion.
6 respondents = Yes
1 = Sort of
0 = No
I feel clear rating learners on the handling questions criterion.
4 respondents = Yes
3 = Sort of
0 = No |
On one hand, raters reported they felt the rater training had
prepared them for their work. Further, previously problematic criteria
such as word stress, prominence, and thought groups were treated with
more certainty from the raters with six or more raters saying they could
rate ITA candidates on these criteria with clarity. These criteria
focus on discourse intonation, which is an array of features of
continuous speech that can be difficult to learn how to rate in
real-time conditions. On the other hand, the criteria of transitional
phrases, and definitions and examples, also shown to be problematic in a
previous study, continued to be treated with less certainty by raters,
with five raters saying they felt clear rating students, but two raters
saying they felt only “sort of” clear rating students on the criteria.
Further, the handling questions criterion emerged as an area of concern,
with only four raters saying they felt clear rating ITA candidates, and
three saying they only felt “sort of” clear rating ITAs.
Results From Actual Ratings
In accordance with component #5 of our model, four types of
analysis were done on raters’ actual ratings from the final presentation
test during the workshop. These were
- interrater reliability analysis,
- third rating analysis,
- Mann-Whitney U tests on comparisons of total scores between passing and failing students, and
- MANOVA analysis to estimate the effect sizes of each of the
10 criteria in discerning passing and failing students.
Interrater Reliability and Third Rating Analysis
To do the correlations, we entered the ITA candidates’ scores
on all 10 criteria by rater team (three teams of two raters apiece) and
created a total score column for each rater within a team. Then the
total scores of the two raters for each team were correlated (see Table
2). Correlation is a standard, must-do, and easy statistical procedure
for estimating how close two raters on the same team are to each other.
Table 2 shows a gradual increase in reliability from Team One to Team
Three. The need for third ratings was based on real-world consequences;
in other words, failure on this test meant failing the workshop. The
results from a third rater show that Raters 1 and 2 are too far apart.
Table 2.Results of three rater teams
Raters |
1 and 2 (Team 1) |
3 and 4 (Team 2) |
5 and 6 (Team 3) |
Correlation |
.72 |
.75 |
.89 |
Pass/Fail Data |
Rater 1: Pass 15, fail 8
Rater 2: Pass 21, fail 2 |
Rater 3: Pass 17, fail 7
Rater 4: Pass 20, fail 4 |
Rater 5: Pass 21, fail 2
Rater 6: Pass 21, fail 2 |
Third Ratings Required* |
4 |
1 |
0 |
Determination of Third Rater |
Pass 3, fail 1 |
Pass 1, fail 0 |
n/a |
Conclusion |
Interrater reliability is inadequate.
Rater 1 may be scoring too harshly, while Rater 2 may be scoring too leniently. |
Interrater reliability is adequate. |
Interrater reliability is good. |
Note. In the event the two total scores from
the two raters is four or more points apart, and if the two averaged
scores are 37 or above (putting the test candidate within striking
distance of passing—the cut score is 38), a third rating is required. A
third rater, who does not see the scores of the original two raters, is
asked to score the candidate’s video recorded performance.
We think an interrater reliability of .75 is an acceptable cut
point to determine adequacy of rater agreement. Thus, we conclude that
Raters 1 and 2, with an interrater reliability of .72, may be outliers.
This is not the result we hoped for. We thought that higher interrater
reliability across all three rater teams might result from an initial
training, given the revised training: An additional training after the
first performance test, greater clarity of descriptors, home rating
work, and rater-led discussion based on the consequential feedback
visuals we displayed.
Authors who specialize in rater training research say little
about dealing constructively with raters who are less responsive to
feedback on their rating, other than to note this is an enduring source
of concern for educators who use high-stakes performance tests. Wang
(2010) is an accessible example. We also note, however, that our policy
for third ratings operated well to ensure that ITA candidates who
performed right about the cut point of 38, and who were rated by Raters 1
and 2, were not negatively affected. And, we also note that based on
the Mann-Whitney U test analysis we report below, all raters as a group
were internally consistent, meaning that they did not score high or low
out of caprice, without a basis in the 10 test criteria.
Mann-Whitney U Test and MANOVA Analyses
In order to learn whether all six raters clearly differentiated
between passing and failing students on the 10 criteria, descriptive
statistics were calculated, along with Mann-Whitney U tests. If raters
clearly differentiated between passing and failing students on all
criteria, this suggests the raters had not only clarity on the criteria,
but used the criteria consistently to score ITA candidates. The ratings
from the six raters for all 70 ITA candidates were added together to
create 10 columns, one for each variable criterion (each column had 70
cases, or data points). An 11th column was added, which had the total
score for each candidate along with the tag of “pass” or “fail.” The cut
score for the test had been set at 38 (10 criteria with 4 possible
points for each would add up to 40 as the highest possible score).
According to this cut score, 109 ITA candidates passed and 34 ITA
candidates failed the test. The Mann-Whitney test simply compares two
means to determine whether they are statistically different. For
instance, on the word-level pronunciation criterion, failing students
got a mean of 3.32 but passing students got a mean of 3.88 (4 being the
highest possible score for one criterion). (See Table 3.)
Table 3. Results of Mann-Whiney U Test analysis and examination of MANOVA effect sizes
Criterion |
Outcome |
Outcome n |
Mean |
Standard Deviation |
Mann-
Whitney
U Test Signifi-
cance |
Effect Size Result-
ing From MANOVA |
Word-Level Pronun-
ciation |
fail
pass |
34
109 |
3.32
3.88 |
.53
.31 |
.000* |
.29 |
Word
Stress |
fail
pass |
34
109 |
3.70
3.92 |
.46
.26 |
.001* |
.08 |
Thought
Groups |
fail
pass |
34
109 |
3.55
3.85 |
.50
.36 |
.000* |
.09 |
Grammatical Structures |
fail
pass |
34
109 |
3.74
3.89 |
.45
.31 |
.027* |
.04 |
Transitional Phrases |
fail
pass |
34
109 |
3.82
3.99 |
.39
.10 |
.000* |
.11 |
Definitions
and
Examples |
fail
pass |
34
109 |
3.85
4.00 |
.36
0 |
.000* |
.12 |
Prominence |
fail
pass |
34
109 |
3.65
3.95 |
.49
.23 |
.000* |
.15 |
Audience Non-
compre-hension Awareness |
fail
pass |
34
109 |
3.68
3.96 |
.47
.19 |
.000* |
.16 |
Tone
Choices |
fail
pass |
34
109 |
3.41
3.93 |
.50
.26 |
.000* |
.31 |
Handling Questions |
fail
pass |
34
109 |
3.62
3.95 |
.50
.21 |
.000* |
.19 |
* Reject the null hypothesis (raters’ ratings were
significantly different between fail and pass decisions on the
criterion).
All comparisons were statistically significantly different. We
interpret this to mean that raters used each criterion systematically
and meaningfully to arrive at pass/fail decisions. The effect sizes from
the MANOVA, also in Table 3, show the meaningfulness of raters’ use of
criteria to pass and fail students. In other words, we can see how much
weight raters as a group gave to a criterion to determine a pass/fail
decision. In the social sciences, anything above a .10 is an effect size
to pay attention to. All but three of the criteria had effect sizes .11
or more.
Decision Points
Because this is evaluation research, the end result of our work
is to make decision points, which are policy recommendations. They
are:
- Conduct future rater training using the model, continuing
the practice of sending video files and training materials to raters
before the training, using the consequential feedback visual (see our previous
article for the visual and discussion technique), and
rerating video files after further discussion and experience rating a
practice test with new ITA candidates.
- Revise the training descriptors to include specific examples
of observable characteristics for criteria where raters feel uncertain,
such as audience noncomprehension awareness and grammatical
structures.
- Revise the rating procedure to ensure at least two questions are asked of ITAs during the presentation.
- Continue to do both during and after analyses using
qualitative and quantitative data, and using conventional interrater
reliability estimates in addition to Mann-Whitney U
test and MANOVA.
- Continue our practice with third ratings. Evidence from the
Mann-Whitney U test and the MANOVA suggests that all raters could
differentiate between passing and failing students on all criteria. This
suggests that our third rating procedure, which focuses on test
candidates who are right at the point of passing and failing, is well
placed.
- Consider providing interrater reliability estimates and
consequential feedback to raters sooner after the test, and in such a
way as to focus on specific criteria on which test candidates are being
assessed too harshly or too leniently.
- Consider predetermining a nonhire decision at which an
individual rater continues to be an outlier (rating too leniently or too
harshly) and yet seems no longer sensitive to training. This is an
admission that even though rater training helps with clarity about the
criteria, the same rater training may not cause raters to have insights
about what performance deserves a higher, or lower, grade.
- Decision point #7 suggests a need for a standards setting
process, by which specific levels of performance are agreed upon and can
be demonstrated with multiple video files.
References
Cizek, G., & Bunch, M. (2007). Standard
setting: A guide to establishing and evaluating performance standards on
tests. Thousand Oaks, CA: Sage.
Gorsuch, G., Florence, R. D., & Griffee, D. T. (2016,
February). Evaluating and improving rater training for ITA performance
tests. ITAIS Newsletter. Retrieved from http://newsmanager.commpartners.com/tesolitais/issues/2016-02-02/3.html
Hambleton, R. (2008). Setting performance standards on
educational assessments and criteria for evaluating the process. In G.
Cizek (Ed.), Setting performance standards (pp.
89–116). New York, NY: Routledge.
Wang, B. (2010). On rater agreement and rater training. English Language Teaching, 3(1), 108–112.
Greta Gorsuch recently reached the milestone of 30
years as an ESL teacher. She has taught in Japan, the United States, and
Vietnam, and thinks she has at least one more overseas sojourn in her.
She is interested in evaluation, materials, testing, audio-supported
reading, listening, and speaking fluency, and has published many
articles and books on these topics. (gretagorsuch.wix.com/mysite)
Dale Griffee is author of An Introduction to
Second Language Research Methods: Design and Data. He teaches
ESL and edits an in-house journal for ELS Language Centers which
prevents teacher burnout and promotes curriculum
renewal. |