The Use of Person-Fit Statistics in Computerized Adaptive Testing (CT-97-14)
Rob R. Meijer and Edith M.L.A. van Krimpen-Stoop, University of Twente, Enschede, The Netherlands
In computerized adaptive testing (CAT), it is often assumed that an item response theory (IRT) model adequately describes the response behavior of the test takers. IRT is a mathematical theory that models the likelihood that a test taker of a particular level of ability will answer a particular test item correctly. In reality, test takers may not always respond to test items in ways that are consistent with the IRT model. Some possible reasons for this “nonfitting” behavior include test anxiety, guessing, cheating (achievement tests), or faking responses (personality inventories).
Several person-fit statistics have been proposed to detect test takers with nonfitting response behavior. To evaluate these statistics, their distributions under the null assumption of perfect model fit must be known. For conventional standardized tests, the distributions of several fit statistics have been investigated, and the results are well documented. However, less is known about these distributions for adaptive tests.
In this paper, three different simulation studies are reported. Each study simulated the use of a frequently used person-fit statistic, namely the standardized log-likelihood statistic, lz. In the first study, the null distribution of the lz-statistic (i.e., the distribution one would expect if test takers respond in a manner predicted by the mathematical model studied) was estimated both for the true ability level as well as an estimate of it. In the second study, the null distributions for a fixed-format test and a CAT were compared. In the final study, the statistic for detecting several types of aberrant response behavior in CAT was studied.
The results indicated that the empirical null distribution of lz for CATs differed from those for tests with a fixed format, and that the differences were larger if estimates of the ability parameters were used rather than their true values. As a result, the detection rates for different types of aberrant behavior were generally low. However, the results also indicated that the estimation errors in the ability estimates due to aberrant behavior on some items were not dramatically different from those for the test takers that behaved according to the model on all items. Thus, in practice, ability can be estimated reasonably well if the response behavior of the test takers does not fit the model for some of the items.