Quality Control of Online Calibration in Computerized Assessment (CT-97-15)
by C. A. W. Glas, University of Twente, Enschede, The Netherlands
The introduction of computerized adaptive testing (CAT) has made it necessary to build large pools of test items with the item statistics (commonly called parameters) needed to describe the characteristics of the items. The process of obtaining item parameters usually consists of the following two stages:
1. Pretest stage. In a series of sessions, sets of items are administered to groups of test takers, and a mathematical model called item response theory (IRT) is used to obtain estimates of item parameters representing such features as item difficulty, discriminating power (the ability of the item to distinguish between more and less able test takers), or susceptibility to guessing.
2. Online stage. The test is operational and administered online but the responses are also used for parameter estimation, for example, to keep improving the precision of previous estimates or to obtain estimates for new items added to the pool.
In this paper it is proposed that methods of quality control be used in the calibration process, for example, to check if the values of the item parameters have not drifted between the pretest and the online stage. If parameter drift is found, the response data cannot be pooled to increase the precision of the parameter estimates. Methods of quality control can also be used to detect security breaches in an online stage. Three different statistics for quality control are proposed: (1) a Lagrange multiplier (LM) statistic; (2) a Wald statistic; and (3) a cumulative sum (CUSUM) statistic. The power of the tests based on these statistics, that is, their ability to detect shifts in the parameter values, was evaluated.
It was found that the tests had moderate to good power to detect shifts in the values of the guessing and difficulty parameters. In addition, all tests were equally sensitive to shifts in the values of all parameters, even if the null hypothesis of no shift was formulated for only one of them. This result is not surprising because estimates of the parameters in the model evaluated are usually highly correlated. The practical conclusion from the study is that all of these statistics can be used very well to detect if something has happened to the item parameters but that it may be difficult to attribute the problems to specific parameters.