ITAIS Newsletter - February 2016

February 2016

ARTICLES

EVALUATING AND IMPROVING RATER TRAINING FOR ITA PERFORMANCE TESTS

Greta Gorsuch, R. Dustin Florence, & Dale Griffee

Greta Gorsuch
Texas Tech University,
Lubbock, Texas, USA

R. Dustin Florence
Texas Tech University,
Lubbock, Texas, USA

Dale Griffee
Texas Tech University ELS,
Lubbock, Texas, USA

Teaching simulation tests are a staple of international teaching assistant (ITA) evaluation for readiness to teach in U.S. classrooms. This article focuses on the evaluation of teaching simulation tests, hereafter called performance tests. Performance tests are:

a test in which learners demonstrate their ability to engage in a task that closely resembles a likely future assignment and is judged by one or more raters on certain criteria (e.g., language ability and possibly field specific knowledge) each of which has scale of accomplishment. (Griffee & Gorsuch, 2015, p. 123)

We focus on the raters of performance tests and present a rationale and model for improving and evaluating rater training.

Rationales for Evaluating Our Rater Training

Since 2000, we have written and revised nine successive versions of an ITA performance test. The revisions were based on our changing understanding of the theoretical construct and also the needs of the raters while assessing 25 or more ITA teaching simulations at a time. We used Version 9 the longest, and in response to good testing practice, we periodically evaluated the validity of decisions we made with the tests. In one study we found that raters with three different background factors differentially understood three test criteria: word stress, prominence (sentence-level stress), and use of transitional phrases (Gevara, Gorsuch, Almekdash, & Jiang, 2015). The rater factors were L1 background, education, and experience teaching ITAs (see Figure 1).

In a second study, Florence (2015) found low correlation coefficients within three teams of raters (two raters per team) on the prominence (sentence stress) criterion, and moderately low correlation coefficients within two rater teams on the criteria of transitional phrases, thought groups, definitions and examples, and word stress. (See ITA Performance Test V.9 and V.10.1 for more definitions for the criteria). The raters did not agree with each other. In sum, we had reason to believe that our raters understood the constructs of the test differently.

Our “Rater Realities”

It is difficult to hire the six instructors needed for our summer workshop (our instructors are also raters for the workshop on three different assessments). There is no source of year-round employment for ESL instructors with the expertise we need. While our raters/instructors have MAs, they still have limited experience with learners who need their L2s for professional purposes. At the same time, we have been pressured into hiring native-speaker instructors who have no experience with our learner population. Given our situation, our apparent “default” model of rater training is likely inadequate (see Figure 1).

Figure 1. Rater factors, rater training, and raters’ assessments model

Even if we could adequately model the differential effects of our rater factors (the left hand element in Figure 1), this does not address the middle element, rater training. It is an opaque blank, and as can be seen by the arrow going to the right, it is assumed that the training will change the scores raters award on ITAs’ performances. Further, this model does not suggest how to evaluate rater training. We felt that given this, we needed to revise, improve, and evaluate rater training. Up to this point, we used what is likely a standard procedure for rater training in our field (Florence, 2015):

Showing raters previous recorded ITA performances on the presentation test.
Giving ratings and discussing the ratings.
Raters explaining their rationales for ratings.
ITA directors or coordinators giving advice and feedback on how to better understand the constructs measured in the test (the criteria).

This said, we spent about 2 days on rater training, but for three separate tests. The question facing us then was: How do we move to a new rater training procedure and training evaluation model?

A New Rater Training Procedure and Rater Training Evaluation Model

We turned to the L2 testing and mainstream assessment fields (Cizek & Bunch, 2007; Hambleton, 2008) and arrived at a new proto-model (Figure 2) with the cells in rough order (left to right) in terms of planning a rater training. The model also serves as a model for rater training evaluation. At each step we can use the model to pose the questions: “Did we do this?” and “How successfully did we do this?” This article focuses on the first and third cells.

Figure 2. Rater training procedure and rater training evaluation model

Demonstratable Clarity of the Test Constructs

This is probably the most overlooked, yet most essential component of establishing validity for a test. To us, “demonstratable” meant “easy to teach to others.” We realized our training materials were likely inadequate and that better materials would better support our rater training. We maintained the 10 evaluation criteria of test V.9, as we remained satisfied with the constructs they capture (ITA Performance Test V.9). At the same time, we reduced test V.9’s five-point Likert scale for each of the ten criteria to four categories and wrote detailed descriptors for each category. In the meantime, we had hired six instructors/raters for the workshop. We wanted to involve raters in the training early on. We wanted to create names for the four categories for each criterion that would be strongly related to the standard we wished to set for a minimally acceptable performance at a given level. Using an online survey tool, we queried instructors and TAs on several configurations of category names and arrived at:

Sustainably fluent and communicative in the classroom.
Intermediate in classroom communication.
Beginner in classroom communication.
Pre-functional in classroom communication

We wrote long descriptors for training purposes and extracted shorter forms of the descriptors for the test instrument that raters would use while ITAs gave their teaching simulations. We also wrote training notes for seven of the ten criteria to address two needs: 1) To probe the reasons for specific test candidates’ performance characteristics, and 2) To address the scoring implications of compensatory techniques candidates might use (ITA Performance Test V.10.1 and ITA Performance Test V.10.1 Criteria and Category Descriptors). We wrote a scoring procedure for when raters are actually scoring a teaching simulation (ITA Performance Test V.10 Rating Procedures). Finally, we created a training DVD for raters. The DVD included three video files from with teaching simulations of three ITAs (an East Asian female speaker, an East Asian male speaker, and an Indian subcontinent female speaker).

Rater Training Has Essential Features

The DVD along with a rater-training packet with the materials described above was given to the raters before the training session. We assigned each rater a numerical code (“Rater 1”) to use throughout the workshop. We intended the folders to be used throughout the workshop as a means of rater self-regulation and self-discovery. We asked the raters to read through the materials and watch the three video files before the training. On day one, we asked each participant to read the training descriptors aloud and opened the floor to discussion and rater paraphrasing. The test instrument was presented, and the rater procedures were read aloud by participants. The raters were then asked to view the three video files on the rater training DVD overnight and to send in their scores for each of the 10 criteria plus an overall judgment of whether the candidate should be approved to teach. This information was compiled for day two, and appeared to the training participants in a grid, which appears in partial form here (there were vertical spaces for all six raters and four TAs; Figure 3):

Figure 3. Rater scores and judgment

Figure 3 is the grid we used to provide “consequential feedback” (Cizek & Bunch, 2007) to the raters for the East Asian female speaker. The rater codes enabled raters to know who they were and how their ratings compared to the scores awarded by others on all criteria, without self-consciousness.

The second day was spent looking at the grid, letting the raters mull over the data at length, and waiting for raters’ questions. We used raters’ questions to focus on specifics of the long training descriptors, and allowed for more experienced raters to provide elaborations and paraphrases of the training descriptors. We also took notes for wording changes for the descriptors. Finally, we reviewed the scoring procedure and rated one of the audio files again, paying attention to the scoring procedures but also to encourage re-viewing of the video files.

Conclusion

In sum, we worked up a rater training and evaluation model based on the mainstream second language assessment and evaluation literature to improve apparent weaknesses in our rater training. We found that having a model for planning and evaluating rater training really facilitated our work. The model helped us determine the content and scope of what we needed to do. We do not have space here to discuss an important part of this model, that of evaluating the training during the workshop and also after the workshop. Please come to our presentation at TESOL 2016 in Baltimore for procedural details (Wednesday, 6 April 11:30 am–12:15 pm, Peale Room, Hilton Baltimore).

References

Cizek, G., & Bunch, M. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.

Florence, D. (2015). ITA performance test inter-rater reliability. Unpublished manuscript.

Gevara, J. R., Gorsuch, G., Almekdash, H., & Jiang, W. (2015). Native and non-native English speaking ITA performance test raters: Do they rate ITA candidates differently? In G. Gorsuch (Ed.), Talking matters: Research on talk and communication of international teaching assistants (pp. 313–346). Stillwater, OK: New Forums Press.

Griffee, D. T., & Gorsuch, G. (2015). Second language testing for student evaluation and classroom research. Manuscript submitted for publication. [Available from authors: dale.griffee@att.net; greta.gorsuch@ttu.edu]

Hambleton, R. (2008). Setting performance standards on educational assessments and criteria for evaluating the process. In G. Cizek (Ed.), Setting performance standards (pp. 89–116). New York, NY: Routledge.

Greta Gorsuch has been teaching ESL and EFL for 30 years. She is interested in materials, testing, reading, listening, fluency development for speaking and reading, and the professional development of teachers in all fields at all career stages.

Dustin Florence has been teaching EFL for 15 years. He is interested in literacy development in L2 learners, development of fluency in reading and speaking, and teacher training.

Dale Griffee is interested in classroom testing and evaluation, and especially in how research methods can be applied to classroom research.