
Greta Gorsuch
Texas Tech University,
Lubbock, Texas, USA |

R. Dustin Florence
Texas Tech University,
Lubbock, Texas, USA
|

Dale Griffee
Texas Tech University ELS,
Lubbock, Texas, USA
|
Teaching simulation tests are a staple of international
teaching assistant (ITA) evaluation for readiness to teach in U.S.
classrooms. This article focuses on the evaluation of teaching
simulation tests, hereafter called performance tests. Performance tests
are:
a test in which learners demonstrate their ability to engage in
a task that closely resembles a likely future assignment and is judged
by one or more raters on certain criteria (e.g., language ability and
possibly field specific knowledge) each of which has scale of
accomplishment. (Griffee & Gorsuch, 2015, p. 123)
We focus on the raters of performance tests and present a
rationale and model for improving and evaluating rater training.
Rationales for Evaluating Our Rater Training
Since 2000, we have written and revised nine successive versions of an ITA performance test. The revisions were based on our changing understanding of the theoretical construct and also the needs of the raters while assessing 25 or more ITA teaching simulations at a time. We
used Version 9 the longest, and in response to good testing practice, we
periodically evaluated the validity of decisions we made with the
tests. In one study we found that raters with three different background
factors differentially understood three test criteria: word stress,
prominence (sentence-level stress), and use of transitional phrases
(Gevara, Gorsuch, Almekdash, & Jiang, 2015). The rater factors
were L1 background, education, and experience teaching ITAs (see Figure
1).
In a second study, Florence (2015) found low correlation
coefficients within three teams of raters (two raters per team) on the
prominence (sentence stress) criterion, and moderately low correlation
coefficients within two rater teams on the criteria of transitional
phrases, thought groups, definitions and examples, and word stress. (See ITA Performance Test V.9 and V.10.1 for more definitions for the criteria). The raters
did not agree with each other. In sum, we had reason to believe that our
raters understood the constructs of the test differently.
Our “Rater Realities”
It is difficult to hire the six instructors needed for our
summer workshop (our instructors are also raters for the workshop on
three different assessments). There is no source of year-round
employment for ESL instructors with the expertise we need. While our
raters/instructors have MAs, they still have limited experience with
learners who need their L2s for professional purposes. At the same time,
we have been pressured into hiring native-speaker instructors who have
no experience with our learner population. Given our situation, our
apparent “default” model of rater training is likely inadequate (see
Figure 1).
Figure 1. Rater factors, rater training, and raters’ assessments model
Even if we could adequately model the differential effects of
our rater factors (the left hand element in Figure 1), this does not
address the middle element, rater training. It is an opaque blank, and
as can be seen by the arrow going to the right, it is assumed that the
training will change the scores raters award on ITAs’ performances.
Further, this model does not suggest how to evaluate rater training. We
felt that given this, we needed to revise, improve, and evaluate rater
training. Up to this point, we used what is likely a standard procedure
for rater training in our field (Florence, 2015):
- Showing raters previous recorded ITA performances on the presentation test.
- Giving ratings and discussing the ratings.
- Raters explaining their rationales for ratings.
- ITA directors or coordinators giving advice and feedback on
how to better understand the constructs measured in the test (the
criteria).
This said, we spent about 2 days on rater training, but for
three separate tests. The question facing us then was: How do we move to
a new rater training procedure and training evaluation model?
A New Rater Training Procedure and Rater Training Evaluation Model
We turned to the L2 testing and mainstream assessment fields
(Cizek & Bunch, 2007; Hambleton, 2008) and arrived at a new
proto-model (Figure 2) with the cells in rough order (left to right) in
terms of planning a rater training. The model also serves as a model for
rater training evaluation. At each step we can use the model to pose
the questions: “Did we do this?” and “How successfully did we do this?”
This article focuses on the first and third cells.
Figure 2. Rater training procedure and rater training evaluation model
Demonstratable Clarity of the Test Constructs
This is probably the most overlooked, yet most essential
component of establishing validity for a test. To us, “demonstratable”
meant “easy to teach to others.” We realized our training materials were
likely inadequate and that better materials would better support our
rater training. We maintained the 10 evaluation criteria of test V.9, as
we remained satisfied with the constructs they capture (ITA Performance
Test V.9). At the same time, we reduced test V.9’s five-point
Likert scale for each of the ten criteria to four categories and wrote
detailed descriptors for each category. In the meantime, we had hired
six instructors/raters for the workshop. We wanted to involve raters in
the training early on. We wanted to create names for the four categories
for each criterion that would be strongly related to the standard we
wished to set for a minimally acceptable performance at a given level.
Using an online survey tool, we queried instructors and TAs on several
configurations of category names and arrived at:
- Sustainably fluent and communicative in the classroom.
- Intermediate in classroom communication.
- Beginner in classroom communication.
- Pre-functional in classroom communication
We wrote long descriptors for training purposes and extracted
shorter forms of the descriptors for the test instrument that raters
would use while ITAs gave their teaching simulations. We also wrote
training notes for seven of the ten criteria to address two needs: 1) To
probe the reasons for specific test candidates’ performance
characteristics, and 2) To address the scoring implications of
compensatory techniques candidates might use (ITA Performance Test
V.10.1 and ITA Performance Test V.10.1 Criteria and Category
Descriptors). We wrote a scoring procedure
for when raters are actually scoring a teaching simulation (ITA
Performance Test V.10 Rating Procedures). Finally,
we created a training DVD for raters. The DVD included three video files
from with teaching simulations of three ITAs (an East Asian female
speaker, an East Asian male speaker, and an Indian subcontinent female
speaker).
Rater Training Has Essential Features
The DVD along with a rater-training packet with the materials
described above was given to the raters before the training session. We
assigned each rater a numerical code (“Rater 1”) to use throughout the
workshop. We intended the folders to be used throughout the workshop as a
means of rater self-regulation and self-discovery. We asked the raters
to read through the materials and watch the three video files before the
training. On day one, we asked each participant to read the training
descriptors aloud and opened the floor to discussion and rater
paraphrasing. The test instrument was presented, and the rater
procedures were read aloud by participants. The raters were then asked
to view the three video files on the rater training DVD overnight and to
send in their scores for each of the 10 criteria plus an overall
judgment of whether the candidate should be approved to teach. This
information was compiled for day two, and appeared to the training
participants in a grid, which appears in partial form here (there were
vertical spaces for all six raters and four TAs; Figure 3):
Figure 3 is the grid we used to provide “consequential
feedback” (Cizek & Bunch, 2007) to the raters for the East Asian
female speaker. The rater codes enabled raters to know who they were
and how their ratings compared to the scores awarded by others on all
criteria, without self-consciousness.
The second day was spent looking at the grid, letting the
raters mull over the data at length, and waiting for raters’ questions.
We used raters’ questions to focus on specifics of the long training
descriptors, and allowed for more experienced raters to provide
elaborations and paraphrases of the training descriptors. We also took
notes for wording changes for the descriptors. Finally, we reviewed the
scoring procedure and rated one of the audio files again, paying
attention to the scoring procedures but also to encourage re-viewing of
the video files.
Conclusion
In sum, we worked up a rater training and evaluation model
based on the mainstream second language assessment and evaluation
literature to improve apparent weaknesses in our rater training. We
found that having a model for planning and evaluating rater training
really facilitated our work. The model helped us determine the content
and scope of what we needed to do. We do not have space here to discuss
an important part of this model, that of evaluating the training during
the workshop and also after the workshop. Please come to our
presentation at TESOL 2016 in Baltimore for procedural details
(Wednesday, 6 April 11:30 am–12:15 pm, Peale Room, Hilton
Baltimore).
References
Cizek, G., & Bunch, M. (2007). Standard
setting: A guide to establishing and evaluating performance standards on
tests. Thousand Oaks, CA: Sage.
Florence, D. (2015). ITA performance test inter-rater reliability. Unpublished
manuscript.
Gevara, J. R., Gorsuch, G., Almekdash, H., & Jiang, W.
(2015). Native and non-native English speaking ITA performance test
raters: Do they rate ITA candidates differently? In G. Gorsuch (Ed.), Talking matters: Research on talk and communication of
international teaching assistants (pp. 313–346). Stillwater, OK: New Forums Press.
Griffee, D. T., & Gorsuch, G. (2015). Second
language testing for student evaluation and classroom
research. Manuscript submitted for publication. [Available
from authors: dale.griffee@att.net; greta.gorsuch@ttu.edu]
Hambleton, R. (2008). Setting performance standards on
educational assessments and criteria for evaluating the process. In G.
Cizek (Ed.), Setting performance standards (pp.
89–116). New York, NY: Routledge.
Greta Gorsuch has been teaching ESL and EFL for 30
years. She is interested in materials, testing, reading, listening,
fluency development for speaking and reading, and the professional
development of teachers in all fields at all career stages.
Dustin Florence has been teaching EFL for 15 years. He
is interested in literacy development in L2 learners, development of
fluency in reading and speaking, and teacher
training.
Dale Griffee is interested in classroom testing and
evaluation, and especially in how research methods can be applied to
classroom research. |