ITAIS Newsletter - July 2016 (Plain Text Version)

Return to Graphical Version

In this issue:
LEADERSHIP UPDATES
• LETTER FROM THE CHAIR
• LETTER FROM THE CHAIR-ELECT
• LETTER FROM THE EDITOR
ARTICLES
• WAYS TO EVALUATE RATER TRAINING FOR ITA PERFORMANCE TESTS
• CAN TOEFL IBT SCORES PREDICT ITA TEACHING PERFORMANCE?
EXTRA CATEGORY
• TEACHING TIP: SELF-EVALUATION ACTIVITIES FOR VIDEO-RECORDED ITA PRESENTATIONS
ABOUT THIS COMMUNITY
• ABOUT THIS COMMUNITY

ARTICLES

WAYS TO EVALUATE RATER TRAINING FOR ITA PERFORMANCE TESTS

Greta Gorsuch, Texas Tech University, Lubbock, Texas, USA & Dale Griffee, ELS Language Centers, Lubbock, Texas, USA

Greta Gorsuch

Dale Griffee

Introduction and Description of a Rater Training Design and Evaluation Model

In our previous article, “Evaluating and Improving Rater Training for ITA Performance Tests” (Gorsuch, Florence, & Griffee, 2016), we recounted empirical evidence that led us to believe our rater training for our teaching simulation test (ITA Performance Test V.9) needed to be improved. We believed raters were uncertain about the rating criteria. We also believed that rater background factors (native-speaker vs nonnative-speaker status, level of education, and level of experience with ITA learner populations), over which we had little control, had undue negative effects on the reliability of our ratings. This seemed a critical issue, as the ITA Performance Test is a high-stakes assessment. Using the literature from mainstream evaluation (e.g., Cizek & Bunch, 2007; Hambleton, 2008) and from our own field of L2 testing and evaluation, we created a proto-model with the following components, listed in roughly chronological order:

Demonstrable clarity of the test constructs, the test task, and performance standards. Demonstratable = easy to teach to others.
Raters have appropriate backgrounds.
Rater training procedure has essential features including: repeated and iterative experience with real performance data, quantitative and qualitative feedback, discussion.
Raters are queried on the rater training and responses are compiled and used to make decision points.
Ratings are analyzed quantitatively for interrater and intrarater consistency.

We created new training materials that we hoped had “demonstrable clarity,” including a new scoring system that had input from the raters (ITA Performance Test V. 10.1 and ITA Performance V. 10.1 Training Descriptors). We conducted the training using essential features including repeated and iterative experience with real performance data, various forms of feedback, and rater-directed discussion (for details see Gorsuch et al., 2016). In this article, we report on the final two components of the model, which compose the evaluation component of the rater training. In other words, how did we do?

Results of Rater Training Questionnaire

We sent an electronic questionnaire out in the middle of our intensive 3-week workshop in July and August 2015. The ITA Performance Test is given twice during the workshop, once after 5 days as a practice and feedback opportunity, and once again after 10 days as one of the three final measures used to make decisions about whether an ITA candidate is approved to teach or not. Thus, the questionnaire was sent out 2 or 3 days after raters had rated 70 candidates on the practice ITA Performance Test. We had multiple purposes. First, we wanted to identify any significant questions raters continued to have about the test criteria. Second, we wanted to encourage raters’ cognitive review of the ITA Performance Test criteria (see component #3 of the model). And third, we wanted to evaluate the rater training we had at the beginning of the workshop. Table 1 shows what we asked and what we learned.

Table 1. Results of the rater training questionnaire

We had the practice ITA Performance Test early this week. Do you have any initial comments about the test? (5 answered, 2 skipped)

· I'm glad the presentations were 10 minutes long. It was much less stressful. I had more time to pay attention to the different criteria. I'd like some clarification on how to rate criteria #8 [audience noncomprehension awareness] and #10 [handling questions].

· I felt that some students needed more time for feedback from the instructors. Overall, administrating a practice performance test is an excellent procedure to help them succeed in this workshop and teaching career in general :)

· I think that the practice performance test was very effective and useful to the ITA candidates.

· The test is comprehensive and clear. It provides a rater with reliable assessment measures.

· Need to work on getting other students to ask questions. I ended up asking most of the questions.

To what extent do you believe the rater training prepared you to rate ITA candidates on the practice ITA Performance Test?

7 respondents = Very well
0 = Moderately well
0 = Not very well

This rating scale is about how clear you feel about rating specific criteria of the ITA Performance Test.

I feel clear rating learners on the word-level pronunciation criterion.

7 respondents = Yes
0 = Sort of
0 = No

I feel clear rating learners on the word stress pronunciation criterion.

7 respondents = Yes
0 = Sort of
0 = No

I feel clear rating learners on the thought groups criterion.

7 respondents = Yes
0 = Sort of
0 = No

I feel clear rating learners on the grammatical structures criterion.

5 respondents = Yes
2 = Sort of
0 = No

I feel clear rating learners on the transitional phrases criterion.

5 respondents = Yes
2 = Sort of
0 = No

I feel clear rating learners on the definitions and examples criterion.

5 respondents = Yes
2 = Sort of
0 = No

I feel clear rating learners on the prominence criterion.

6 respondents = Yes
1 = Sort of
0 = No

I feel clear rating learners on the audience non-comprehension awareness criterion.

5 respondents = Yes
2 = Sort of
0 = No

I feel clear rating learners on the tone choices criterion.

6 respondents = Yes
1 = Sort of
0 = No

I feel clear rating learners on the handling questions criterion.

4 respondents = Yes
3 = Sort of
0 = No

On one hand, raters reported they felt the rater training had prepared them for their work. Further, previously problematic criteria such as word stress, prominence, and thought groups were treated with more certainty from the raters with six or more raters saying they could rate ITA candidates on these criteria with clarity. These criteria focus on discourse intonation, which is an array of features of continuous speech that can be difficult to learn how to rate in real-time conditions. On the other hand, the criteria of transitional phrases, and definitions and examples, also shown to be problematic in a previous study, continued to be treated with less certainty by raters, with five raters saying they felt clear rating students, but two raters saying they felt only “sort of” clear rating students on the criteria. Further, the handling questions criterion emerged as an area of concern, with only four raters saying they felt clear rating ITA candidates, and three saying they only felt “sort of” clear rating ITAs.

Results From Actual Ratings

In accordance with component #5 of our model, four types of analysis were done on raters’ actual ratings from the final presentation test during the workshop. These were

interrater reliability analysis,
third rating analysis,
Mann-Whitney U tests on comparisons of total scores between passing and failing students, and
MANOVA analysis to estimate the effect sizes of each of the 10 criteria in discerning passing and failing students.

Interrater Reliability and Third Rating Analysis

To do the correlations, we entered the ITA candidates’ scores on all 10 criteria by rater team (three teams of two raters apiece) and created a total score column for each rater within a team. Then the total scores of the two raters for each team were correlated (see Table 2). Correlation is a standard, must-do, and easy statistical procedure for estimating how close two raters on the same team are to each other. Table 2 shows a gradual increase in reliability from Team One to Team Three. The need for third ratings was based on real-world consequences; in other words, failure on this test meant failing the workshop. The results from a third rater show that Raters 1 and 2 are too far apart.

Table 2.Results of three rater teams

Raters	1 and 2 (Team 1)	3 and 4 (Team 2)	5 and 6 (Team 3)
Correlation	.72	.75	.89
Pass/Fail Data	Rater 1: Pass 15, fail 8 Rater 2: Pass 21, fail 2	Rater 3: Pass 17, fail 7 Rater 4: Pass 20, fail 4	Rater 5: Pass 21, fail 2 Rater 6: Pass 21, fail 2
Third Ratings Required*	4	1	0
Determination of Third Rater	Pass 3, fail 1	Pass 1, fail 0	n/a
Conclusion	Interrater reliability is inadequate. Rater 1 may be scoring too harshly, while Rater 2 may be scoring too leniently.	Interrater reliability is adequate.	Interrater reliability is good.

Note. In the event the two total scores from the two raters is four or more points apart, and if the two averaged scores are 37 or above (putting the test candidate within striking distance of passing—the cut score is 38), a third rating is required. A third rater, who does not see the scores of the original two raters, is asked to score the candidate’s video recorded performance.

We think an interrater reliability of .75 is an acceptable cut point to determine adequacy of rater agreement. Thus, we conclude that Raters 1 and 2, with an interrater reliability of .72, may be outliers. This is not the result we hoped for. We thought that higher interrater reliability across all three rater teams might result from an initial training, given the revised training: An additional training after the first performance test, greater clarity of descriptors, home rating work, and rater-led discussion based on the consequential feedback visuals we displayed.

Authors who specialize in rater training research say little about dealing constructively with raters who are less responsive to feedback on their rating, other than to note this is an enduring source of concern for educators who use high-stakes performance tests. Wang (2010) is an accessible example. We also note, however, that our policy for third ratings operated well to ensure that ITA candidates who performed right about the cut point of 38, and who were rated by Raters 1 and 2, were not negatively affected. And, we also note that based on the Mann-Whitney U test analysis we report below, all raters as a group were internally consistent, meaning that they did not score high or low out of caprice, without a basis in the 10 test criteria.

Mann-Whitney U Test and MANOVA Analyses

In order to learn whether all six raters clearly differentiated between passing and failing students on the 10 criteria, descriptive statistics were calculated, along with Mann-Whitney U tests. If raters clearly differentiated between passing and failing students on all criteria, this suggests the raters had not only clarity on the criteria, but used the criteria consistently to score ITA candidates. The ratings from the six raters for all 70 ITA candidates were added together to create 10 columns, one for each variable criterion (each column had 70 cases, or data points). An 11th column was added, which had the total score for each candidate along with the tag of “pass” or “fail.” The cut score for the test had been set at 38 (10 criteria with 4 possible points for each would add up to 40 as the highest possible score). According to this cut score, 109 ITA candidates passed and 34 ITA candidates failed the test. The Mann-Whitney test simply compares two means to determine whether they are statistically different. For instance, on the word-level pronunciation criterion, failing students got a mean of 3.32 but passing students got a mean of 3.88 (4 being the highest possible score for one criterion). (See Table 3.)

Table 3. Results of Mann-Whiney U Test analysis and examination of MANOVA effect sizes

Criterion	Outcome	Outcome n	Mean	Standard Deviation	Mann- Whitney U Test Signifi- cance	Effect Size Result- ing From MANOVA
Word-Level Pronun- ciation	fail pass	34 109	3.32 3.88	.53 .31	.000*	.29
Word Stress	fail pass	34 109	3.70 3.92	.46 .26	.001*	.08
Thought Groups	fail pass	34 109	3.55 3.85	.50 .36	.000*	.09
Grammatical Structures	fail pass	34 109	3.74 3.89	.45 .31	.027*	.04
Transitional Phrases	fail pass	34 109	3.82 3.99	.39 .10	.000*	.11
Definitions and Examples	fail pass	34 109	3.85 4.00	.36 0	.000*	.12
Prominence	fail pass	34 109	3.65 3.95	.49 .23	.000*	.15
Audience Non- compre-hension Awareness	fail pass	34 109	3.68 3.96	.47 .19	.000*	.16
Tone Choices	fail pass	34 109	3.41 3.93	.50 .26	.000*	.31
Handling Questions	fail pass	34 109	3.62 3.95	.50 .21	.000*	.19

* Reject the null hypothesis (raters’ ratings were significantly different between fail and pass decisions on the criterion).

All comparisons were statistically significantly different. We interpret this to mean that raters used each criterion systematically and meaningfully to arrive at pass/fail decisions. The effect sizes from the MANOVA, also in Table 3, show the meaningfulness of raters’ use of criteria to pass and fail students. In other words, we can see how much weight raters as a group gave to a criterion to determine a pass/fail decision. In the social sciences, anything above a .10 is an effect size to pay attention to. All but three of the criteria had effect sizes .11 or more.

Decision Points

Because this is evaluation research, the end result of our work is to make decision points, which are policy recommendations. They are:

Conduct future rater training using the model, continuing the practice of sending video files and training materials to raters before the training, using the consequential feedback visual (see our previous article for the visual and discussion technique), and rerating video files after further discussion and experience rating a practice test with new ITA candidates.
Revise the training descriptors to include specific examples of observable characteristics for criteria where raters feel uncertain, such as audience noncomprehension awareness and grammatical structures.
Revise the rating procedure to ensure at least two questions are asked of ITAs during the presentation.
Continue to do both during and after analyses using qualitative and quantitative data, and using conventional interrater reliability estimates in addition to Mann-Whitney U test and MANOVA.
Continue our practice with third ratings. Evidence from the Mann-Whitney U test and the MANOVA suggests that all raters could differentiate between passing and failing students on all criteria. This suggests that our third rating procedure, which focuses on test candidates who are right at the point of passing and failing, is well placed.
Consider providing interrater reliability estimates and consequential feedback to raters sooner after the test, and in such a way as to focus on specific criteria on which test candidates are being assessed too harshly or too leniently.
Consider predetermining a nonhire decision at which an individual rater continues to be an outlier (rating too leniently or too harshly) and yet seems no longer sensitive to training. This is an admission that even though rater training helps with clarity about the criteria, the same rater training may not cause raters to have insights about what performance deserves a higher, or lower, grade.
Decision point #7 suggests a need for a standards setting process, by which specific levels of performance are agreed upon and can be demonstrated with multiple video files.

References

Cizek, G., & Bunch, M. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.

Gorsuch, G., Florence, R. D., & Griffee, D. T. (2016, February). Evaluating and improving rater training for ITA performance tests. ITAIS Newsletter. Retrieved from http://newsmanager.commpartners.com/tesolitais/issues/2016-02-02/3.html

Hambleton, R. (2008). Setting performance standards on educational assessments and criteria for evaluating the process. In G. Cizek (Ed.), Setting performance standards (pp. 89–116). New York, NY: Routledge.

Wang, B. (2010). On rater agreement and rater training. English Language Teaching, 3(1), 108–112.

Greta Gorsuch recently reached the milestone of 30 years as an ESL teacher. She has taught in Japan, the United States, and Vietnam, and thinks she has at least one more overseas sojourn in her. She is interested in evaluation, materials, testing, audio-supported reading, listening, and speaking fluency, and has published many articles and books on these topics. (gretagorsuch.wix.com/mysite)

Dale Griffee is author of An Introduction to Second Language Research Methods: Design and Data. He teaches ESL and edits an in-house journal for ELS Language Centers which prevents teacher burnout and promotes curriculum renewal.