Assessing Subgroup Differences in Item Response Times (CT-97-03)
by Deborah L. Schnipke and Peter J. Pashley

Executive Summary

The potential always exists for subgroup differences in test performance to be attributable to the content, wording, or other irrelevant aspects of particular items, rather than real differences in the knowledge, skills, or developed abilities of interest. Several methods to explore differential item functioning, or DIF, have been developed to statistically investigate such subgroup differences. All DIF procedures identify items that are relatively more difficult for some subgroups than would be expected from their performance on other items in the test.

Differences in test performance may also be due, in part, to differential response-time rates between subgroups. Differences in response-time rates of items administered early in the test may not affect item-score performance on those items, but may cause the overall form to be excessively time-consuming for some subgroups. Because of strict time limits, excessively time-consuming forms may lead to lower test scores because test takers simply do not have enough time to work on every item. Statistical measures are needed to evaluate the significance of differential response-time rates that are unrelated to the ability or proficiency levels of interest.

The goal of the present study is to investigate methodology for assessing subgroup differences in response-time rates. Rather than considering just the average response time (mean or median) for various subgroups, we take the more powerful approach of looking at the entire distribution of response times for each item using survival-analysis methodology. Additionally, we investigate whether certain subgroups spend more time on items at the beginning of the test, and if this is associated with those subgroups either running out of time more so than other subgroups or simply spending relatively less time on the later items. The methodology is applied to data from a national, high-stakes computer-administered test. The subgroups we compare are primary and nonprimary speakers of English because it seems likely that there will be response-time differences between these two subgroups.

Based on our results, it appears that survival-analysis methodology is useful for uncovering subgroup differences in response times. The analyses conducted in this study were both easy to conduct and to understand. Many statistical packages provide the types of survival analyses used in the present study. While the methodology for determining if test takers ran out of time on later items is not as readily available, it is at least easily understood. The graphical approach used here should be especially helpful to test developers who are attempting to construct fair and equitable assessments.

For a number of items in this study, significant differences between the response-time distributions of the two subgroups investigated were found. This does not necessarily imply that these should have been identified as DIF items. First, we would naturally expect test takers with fewer English skills to take longer answering items. Second, as with most standard DIF procedures, the effect size should be taken into consideration. Although it was not done explicitly in this study, one could examine the response-time curves produced to determine whether any preset difference levels between the curves have been exceeded. Finally, in operational DIF analyses, items identified by the DIF procedures are always reviewed by test specialists who determine whether the effects observed are truly problematic or could more reasonably be attributed to statistical error. Such a review should be done for any items flagged for differential response-time rates as well.

Assessing Subgroup Differences in Item Response Times (CT-97-03)

Research Report Index