A Bayesian Method for the Detection of Item Preknowledge in CAT (CT-98-07)
by Lori D. McLeod, Law School Admission Council; Charles Lewis, Educational Testing Service; and David Thissen, University of North Carolina at Chapel Hill
With the increased use of computerized adaptive testing, which allows for continuous testing, new concerns about test security have evolved, one being the assurance that items in an item pool are safeguarded from theft. As the Law School Admission Council (LSAC) investigates implementing a computerized version of the Law School Admission Test (LSAT), the risk to test security and tools for protecting test items should be explored. The goals of this study include examining test taker success at achieving test score inflation when using item preknowledge and the feasibility of using an odds ratio index as a tool for test security.
This project used simulations based on results from an operational computerized adaptive test (CAT). The design applied a real-world approach to simulate the "item preknowledge" process by incorporating a two-stage process. First, for each condition, the design sent in n sources to memorize test items from a 28-item test. These n test takers memorized their items perfectly and then combined their lists. (Some overlap was observed among the lists.) The complete list was memorized by another group of test takers, the beneficiaries. Then, the beneficiaries were administered a 28-item test, and if they were administered any of the memorized items, they answered them correctly. (Although we acknowledge that memorizers may not have perfect recall of the item list, the simulation was designed to produce a worst-case scenario for the testing program.)
Simulated test takers were generated at true scores from a discrete uniform distribution at 11 ability, q, values. The q values were translated into a number correct true score on a linear 60-item reference test and correspond to the operational test's score range (10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 59). Along with varying the proficiency of the sources (50, 55, 59), groups of two, four, and eight sources were simulated for the various memorizing conditions. A control or null condition in which test takers did not have item preknowledge was included. One advantage for using this design is that it depicts a possible reality, especially if recording equipment is used for item theft. Another advantage is that the design maintains the roles of content constraints and the item selection algorithm in the success of using item preknowledge in the CAT environment.
For the memorizing simulees, across all true scores, the mean test score was inflated upward. Therefore, we conclude that by using the source-beneficiary preknowledge strategy, the test takers were successful in attaining higher test scores. The estimates were, of course, more inflated when the test takers had memorized the longer lists gathered by eight sources. Even the lower ability test takers for the eight sources condition had an average estimated test score above 40 (out of a possible 60 score points). Also as expected, information from four sources did not deliver as much test score gain as information from eight-sources. Similarly, item information from two sources did not aid a beneficiary as much as information from four sources. The estimates were more variable at the lower true scores, where the test takers have more room to benefit from the preknowledge, depending on the peculiarities of item selection. The higher ability test takers do not benefit as much from the memorization because they are already the higher scorers.
An odds ratio procedure to detect test takers using item preknowledge was developed and then evaluated. Specifically, three classes of models were introduced for the probability that an item had been memorized. Based on these models, seven Bayesian indices (FLOR1-FLOR7) were developed. Results from the simulated CAT data indicated that these indices had the power to detect item preknowledge. Overall, the best performing index of those studied is FLOR7, because it has the most power to detect those test takers who had preknowledge of more than half of the items on their test. FLOR3 is selected as the second best performing index for these successful test takers. This index has the extra appeal of being simple to compute without a previous simulation.