An Empirical Bayes Enhancement of Mantel-Haenszel DIF Analysis for Computer-Adaptive Tests (CT-98-15)
by Rebecca Zwick, University of California, Santa Barbara and
Dorothy T. Thayer, Educational Testing Services
The Law School Admission Council (LSAC) is currently investigating the feasibility of implementing a computer adaptive version of the Law School Admission Test (LSAT). The introduction of computer adaptive tests (CATs) requires that new approaches be developed for the analysis of item properties, including differential item functioning (DIF). DIF is said to occur when test takers from two demographic groups (say, men and women) perform differently on an item even after they have been matched in terms of overall ability. The presence of DIF may point to unintended sources of difficulty in the item (e.g., a math item may require sports knowledge that is more common in men than in women). A significant technical challenge in assessing DIF in CATs is the need to develop a method that will produce stable* results in small samples: Even if the total number of test takers for a CAT is large, the number of responses to some items may be very small.
Currently within the testing industry, the Mantel-Haenszel DIP analysis method is the most commonly used to detect DIF for paper-and-pencil tests. In fact, this is the analysis used for the LSAT. A body of statistical methods referred to as empirical Bayes (EB) methods are known to be capable of producing stable statistical results with fewer test takers. The present study investigated the applicability to CAT data of a DIF analysis method that involves an EB enhancement of the popular MH DIF analysis method.
The computerized LSAT test design assumed for this study was similar to that currently being evaluated for a potential computerized LSAT. Here, rather than being presented with a single test item at a time, test takers are presented with small groups of items, commonly referred to as testlets. The CAT pool for this research consisted of 10 five-item testlets at each of three difficulty levels. The item parameters, which are statistics that describe the various item characteristics such as item difficulty, were specified to resemble those typically observed for the LSAT. Using these item-level statistics, responses to the test questions were generated for simulated test takers. These simulations consisted of four conditions that vaired in terms of group sample sizes and group ability distributions; both of these factors are known to affect the performance of DIF methods. Sample sizes for the two test taker groups were either 1,000 or 3,000 (before application of the CAT algorithm). The distribution of test taker ability for the two groups were either the same or differed by one standard deviation.
The results showed the performance of the EB DIF approach to be very promising, even in extremely small samples. The EB estimates tended to be closer to their target values than did the ordinary Mantel-Haenszel (MH) statistics; the EB statistics were also more highly correlated with the true DIF values than were the MH statistics.
*An estimation procedure is considered stable if the estimates tend to be close to their target values.