Anchor-Based Methods for Judgmentally Estimating Item Difficulty Parameters (CT-98-05)
by Ronald K. Hambleton, Stephen G. Sireci, H. Swaminathan, Dehui Xing, and Saba Rizavi, University of Massachusetts at Amherst

Executive Summary

The Law School Admissions Council (LSAC) is currently investigating the feasibility of computerized adaptive testing. One of the important lessons learned from the recent controversy between the Graduate Record Examination Board and Stanley H. Kaplan Educational Centers Ltd is that large item banks will be needed to support computerized adaptive testing.

One dilemma for the LSAC if computerized adaptive testing is adopted is as follows: large test taker samples of candidates are highly desirable in field testing new items (large samples lead to precise estimates of the item statistics that are very important in implementing a computerized adaptive test), while at the same time, large test taker samples also result in more item exposure that can lead to a loss of item security.

One promising idea for lowering the desired size of test taker samples in item statistics estimation would be to combine the judgments of test specialists about the item statistics along with actual field-test data into a Bayesian item parameter estimation procedure where the information provided by the test specialists serves as a prior distribution. Test taker sample sizes can be reduced because the information loss due to smaller sample sizes is replaced by the information about the item statistics provided by the judgments of test specialists or other persons with knowledge about the test items. A Bayesian item parameter estimation procedure is simply a formal statistical way to combine both the information provided by the test takers and the information provided by the test specialists to arrive at the item statistics.

The purposes of this research study were to develop and field test anchor-based judgmental methods for enabling test specialists to estimate item difficulty statistics. The basic idea of anchor-based methods is that test specialists are provided with a frame of reference for making their judgments-descriptions of items at three levels of item difficulty, and/ or many previously calibrated test items along with their item statistics. The task of judging the difficulties of new items amounts to matching the new items to those already calibrated and determining where the new items seem to fit in terms of their difficulty.

The study consisted of three related field tests. In each field test we worked with six Law School Admission Test (LSAT) test specialists and one or more of the LSAT sub tests. In the first, we worked with both reading comprehension and logical reasoning; in the second, analytical reasoning; and in the third, analytical reasoning and reading comprehension. In the typical field test, panelists were trained in one of the variations of an anchor-based method for judging item difficulty statistics, then they were given test items to judge. In addition, panelists were informed during training about the factors that typically influence the difficulty of test items: the amount of negation, sentence and paragraph length; location of relevant text; the presence of effective distractors in the item; the novelty of the problem, and so on. Because of a strong belief that discussion among test specialists was valuable in the judgmental process, after panelists provided their independent ratings, discussions always took place about the test specialists' estimates, then they were given the opportunity to revise their ratings if they felt they could improve the estimates.

The three field tests were simply that-an opportunity to tryout methods, receive feedback from test specialists on their likes and dislikes about the process, and collect ratings data that could be thoroughly analyzed. Because the items that were being judged by test specialists had been used in previous administrations of the LSAT, their item statistics were known to the researchers and could be used to evaluate both individual test specialist ratings and the first and second group estimates of the item statistics.
The three field tests produced a number of conclusions. A considerable amount was learned about the process of extracting test specialists' estimates of item difficulty. The ratings took considerably longer to obtain than had been expected. Training, initial ratings, and discussion took a considerable amount of time. For example, six hours might be needed to train test specialists and to obtain ratings on 20 test items.

Test specialists felt they could be trained to estimate item difficulty accurately, and, to some extent, they demonstrated this. Average error in the estimates of item difficulty varied from about 11 % to 13%. Also, the discussions were popular with the panelists and almost always resulted in improved item difficulty estimates.
We began the study thinking there were at least two frameworks with which we might provide test specialists to improve their item difficulty estimates. By the end of the study, both methods had merged into one. Test specialists seemed to benefit from both the descriptions of items located at three levels of difficulty and from information about the item statistics of many items. Any future work in this area should probably combine both methods.

We completed the project with the feeling that the results were encouraging but would be better with improved training. What we learned was that the test specialists had developed many skills for themselves about what makes items hard and easy, and, therefore, the test specialists could be used more effectively than they were in the study to develop the descriptions of items at three levels of item difficulty and to develop rules for judging test items. With more research and development, we could see a training program for test specialists emerge that would prepare them for judging item statistics as a regular part of their work.

This program would build not only on some of the research carried out in this study, and the relevant literature, but also the insights and experiences of the test specialists themselves. We also share the test specialists' view that there would be some general principles for judging item difficulty but also specific principles linked to the three major areas covered by the test. At least some of the principles of judging item difficulty are specific to the reading comprehension, analytical reasoning, and logical reasoning subtests.
 

Anchor-Based Methods for Judgmentally Estimating Item Difficulty Parameters (CT-98-05)

Research Report Index