The Impact of Item Location Effects on Ability Estimation in CAT: A Simulation Study (CT-01-06)
by Mei Liu, Renbang Zhu, and Fanmin Guo, Educational Testing Service
The past decade has seen an increasing number of testing institutions implement large-scale high-stakes computerized adaptive testing (CAT) programs. In these CAT programs, test takers are presented with items selected from an item pool that are tailored to their ability levels while at the same time satisfying content and other psychometric specifications. Each of these collections of items administered to a test taker is really a form or an edition of the computerized adaptive test. Most CAT programs rely on the use of a mathematical model called item response theory (IRT), which makes it possible to compare the proficiency level of test takers even though they have responded to different collections of items.
One of the fundamental assumptions of IRT—local independence—posits that the statistical characteristics of an item are independent of the other items surrounding it in the test. The order of item presentation is one aspect of this assumption. That is, the psychometric characteristics of items (such as item difficulty and discrimination power) will remain invariant regardless of when and where they appear in a test. Meeting this assumption adequately is vital in CAT, since items do not appear in fixed positions as they do in paper-and-pencil tests. Moreover, selecting items dynamically for a test taker during an adaptive test session means that there is no easy way of controlling when and where an item appears. It is therefore imperative that the initial estimates of item characteristics obtained from pretesting continue to hold in an operational CAT environment.
Most testing organizations obtain estimates of item characteristics for their CAT programs by pretesting (or trying out) items in pre-assembled sections either as an external test section or as item groups seeded throughout an operational test section. This means that the items are administered in fixed positions. When these items become operational, their delivery positions can be very different from their pretest delivery positions. The successful implementation of these CAT programs thus rests on the assumption of invariance within reasonable bounds of item characteristics across the pretesting and operational testing environments.
If the assumption of item characteristic invariance over item position is not adequately met, the item statistics (such as item difficulty estimates) from pretest data where item position is fixed cannot be generalized to the operational testing environment where item delivery position is unpredictable. Adaptive testing is especially vulnerable to such shifts in item characteristics, because there is no feasible mechanism to neutralize the context effects (e.g., to keep the effects relatively constant for all test takers). Furthermore, the very nature of the adaptive testing process also means that, although all test takers may be affected, they may not be affected in the same way or to the same degree (differential impact).
The purpose of this study was to evaluate the impact of item difficulty shift on ability estimation in CAT as a result of change in item delivery positions from pretest to operational administrations. Three factors were investigated: (1) the direction of the shift (items either became easier or more difficult), (2) the degree of the shift in item difficulty, and (3) the pretest and operational delivery positions of these shifted-difficulty items.
The results indicate that a large shift in item difficulty produced more score impact than a lesser shift in item difficulty. In addition, more impact was observed when the studied items were made easier than when these items were made harder. Furthermore, simulated test takers were likely to experience more impact as the number of shifted-difficulty items they encountered increased. This was particularly evident when these items appeared early in a test. Finally, early appearance of the shifted-difficulty items led to large routing effects, which in turn seemed to produce more scores that differed from their baseline scores by 5 points or more.
The research presented here relates directly to computerized item-by-item adaptive testing environments. When using more constrained adaptive testing procedures, context effects such as item position can be mitigated or even kept constant. For example, within the multiple-form structure (MFS) methodology that is being investigated at Law School Admission Council (LSAC), intact forms can be calibrated (or tried out) before they are delivered operationally, much as the current paper-and-pencil forms are.