ITAIS Newsletter - July 2016 (Plain Text Version)
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In this issue: |
ARTICLES WAYS TO EVALUATE RATER TRAINING FOR ITA PERFORMANCE TESTS
Introduction and Description of a Rater Training Design and Evaluation Model In our previous article, “Evaluating and Improving Rater Training for ITA Performance Tests” (Gorsuch, Florence, & Griffee, 2016), we recounted empirical evidence that led us to believe our rater training for our teaching simulation test (ITA Performance Test V.9) needed to be improved. We believed raters were uncertain about the rating criteria. We also believed that rater background factors (native-speaker vs nonnative-speaker status, level of education, and level of experience with ITA learner populations), over which we had little control, had undue negative effects on the reliability of our ratings. This seemed a critical issue, as the ITA Performance Test is a high-stakes assessment. Using the literature from mainstream evaluation (e.g., Cizek & Bunch, 2007; Hambleton, 2008) and from our own field of L2 testing and evaluation, we created a proto-model with the following components, listed in roughly chronological order:
We created new training materials that we hoped had “demonstrable clarity,” including a new scoring system that had input from the raters (ITA Performance Test V. 10.1 and ITA Performance V. 10.1 Training Descriptors). We conducted the training using essential features including repeated and iterative experience with real performance data, various forms of feedback, and rater-directed discussion (for details see Gorsuch et al., 2016). In this article, we report on the final two components of the model, which compose the evaluation component of the rater training. In other words, how did we do? Results of Rater Training Questionnaire We sent an electronic questionnaire out in the middle of our intensive 3-week workshop in July and August 2015. The ITA Performance Test is given twice during the workshop, once after 5 days as a practice and feedback opportunity, and once again after 10 days as one of the three final measures used to make decisions about whether an ITA candidate is approved to teach or not. Thus, the questionnaire was sent out 2 or 3 days after raters had rated 70 candidates on the practice ITA Performance Test. We had multiple purposes. First, we wanted to identify any significant questions raters continued to have about the test criteria. Second, we wanted to encourage raters’ cognitive review of the ITA Performance Test criteria (see component #3 of the model). And third, we wanted to evaluate the rater training we had at the beginning of the workshop. Table 1 shows what we asked and what we learned. Table 1. Results of the rater training questionnaire
Results From Actual Ratings In accordance with component #5 of our model, four types of analysis were done on raters’ actual ratings from the final presentation test during the workshop. These were
Interrater Reliability and Third Rating Analysis To do the correlations, we entered the ITA candidates’ scores on all 10 criteria by rater team (three teams of two raters apiece) and created a total score column for each rater within a team. Then the total scores of the two raters for each team were correlated (see Table 2). Correlation is a standard, must-do, and easy statistical procedure for estimating how close two raters on the same team are to each other. Table 2 shows a gradual increase in reliability from Team One to Team Three. The need for third ratings was based on real-world consequences; in other words, failure on this test meant failing the workshop. The results from a third rater show that Raters 1 and 2 are too far apart. Table 2.Results of three rater teams
We think an interrater reliability of .75 is an acceptable cut point to determine adequacy of rater agreement. Thus, we conclude that Raters 1 and 2, with an interrater reliability of .72, may be outliers. This is not the result we hoped for. We thought that higher interrater reliability across all three rater teams might result from an initial training, given the revised training: An additional training after the first performance test, greater clarity of descriptors, home rating work, and rater-led discussion based on the consequential feedback visuals we displayed. Authors who specialize in rater training research say little about dealing constructively with raters who are less responsive to feedback on their rating, other than to note this is an enduring source of concern for educators who use high-stakes performance tests. Wang (2010) is an accessible example. We also note, however, that our policy for third ratings operated well to ensure that ITA candidates who performed right about the cut point of 38, and who were rated by Raters 1 and 2, were not negatively affected. And, we also note that based on the Mann-Whitney U test analysis we report below, all raters as a group were internally consistent, meaning that they did not score high or low out of caprice, without a basis in the 10 test criteria. Mann-Whitney U Test and MANOVA Analyses In order to learn whether all six raters clearly differentiated between passing and failing students on the 10 criteria, descriptive statistics were calculated, along with Mann-Whitney U tests. If raters clearly differentiated between passing and failing students on all criteria, this suggests the raters had not only clarity on the criteria, but used the criteria consistently to score ITA candidates. The ratings from the six raters for all 70 ITA candidates were added together to create 10 columns, one for each variable criterion (each column had 70 cases, or data points). An 11th column was added, which had the total score for each candidate along with the tag of “pass” or “fail.” The cut score for the test had been set at 38 (10 criteria with 4 possible points for each would add up to 40 as the highest possible score). According to this cut score, 109 ITA candidates passed and 34 ITA candidates failed the test. The Mann-Whitney test simply compares two means to determine whether they are statistically different. For instance, on the word-level pronunciation criterion, failing students got a mean of 3.32 but passing students got a mean of 3.88 (4 being the highest possible score for one criterion). (See Table 3.) Table 3. Results of Mann-Whiney U Test analysis and examination of MANOVA effect sizes
* Reject the null hypothesis (raters’ ratings were significantly different between fail and pass decisions on the criterion). All comparisons were statistically significantly different. We interpret this to mean that raters used each criterion systematically and meaningfully to arrive at pass/fail decisions. The effect sizes from the MANOVA, also in Table 3, show the meaningfulness of raters’ use of criteria to pass and fail students. In other words, we can see how much weight raters as a group gave to a criterion to determine a pass/fail decision. In the social sciences, anything above a .10 is an effect size to pay attention to. All but three of the criteria had effect sizes .11 or more. Decision Points Because this is evaluation research, the end result of our work is to make decision points, which are policy recommendations. They are:
References Cizek, G., & Bunch, M. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage. Gorsuch, G., Florence, R. D., & Griffee, D. T. (2016, February). Evaluating and improving rater training for ITA performance tests. ITAIS Newsletter. Retrieved from http://newsmanager.commpartners.com/tesolitais/issues/2016-02-02/3.html Hambleton, R. (2008). Setting performance standards on educational assessments and criteria for evaluating the process. In G. Cizek (Ed.), Setting performance standards (pp. 89–116). New York, NY: Routledge. Wang, B. (2010). On rater agreement and rater training. English Language Teaching, 3(1), 108–112. Greta Gorsuch recently reached the milestone of 30 years as an ESL teacher. She has taught in Japan, the United States, and Vietnam, and thinks she has at least one more overseas sojourn in her. She is interested in evaluation, materials, testing, audio-supported reading, listening, and speaking fluency, and has published many articles and books on these topics. (gretagorsuch.wix.com/mysite) Dale Griffee is author of An Introduction to Second Language Research Methods: Design and Data. He teaches ESL and edits an in-house journal for ELS Language Centers which prevents teacher burnout and promotes curriculum renewal. |