A Simulation Study of Optimal Online Calibration of Testlets Using Real Data (CT-00-07)
by Douglas H. Jones, Mikhail S. Nediak, Rutgers, The State University of New Jersey

Executive Summary

This research investigates new ways to estimate the characteristics of pretest items, grouped by their relation to a reading passage or other common stimulus. Currently, practitioners use a method called spiraling to obtain a random sample of item response data for estimating item characteristics. However, a random sample of item responses cannot tell how items that are difficult for average-ability test takers would perform for high-ability test takers. In addition, random samples of item responses provide an excess of information for items of average difficulty and discrimination.

These new methods take advantage of the online test-taking environment by selecting the best test takers for the new pretest items. Thus, items receive responses from test takers whose abilities can provide the most information about the items’ characteristics. Estimates of test takers’ abilities are available in the online setting from their performances on operational sections of a test. Response data are taken in batches, and after each new batch of data, item characteristics of the new items are re-estimated. The process stops sampling after enough information is gathered or a limit on the sample size is reached. The new methods are based on established statistical methods in optimal design of experiments, started in the 1920’s by Sir R. A. Fisher, and sequential sampling, started by Dr. A. Wald during WWII.

Up to this time, research has focused on the online collection of item-response data for several unrelated items either in isolation or simultaneously. However, no practical solutions have addressed online collection of item-response data for items in natural groupings, or item sets, such as testlets or common stimulus formats. In this paper, we will assume a situation where a calibration session must collect online data about several pretest testlets simultaneously.

We wish to compare the efficacy of the two sequential optimal design schemes and a spiraling scheme. This investigation uses 28,184 live responses to 101 LSAT items grouped into testlets by common reading-passage stimuli. Online testing is simulated with the collection of 800 responses per testlets. Twenty-three items naturally grouped into four testlets are calibrated. The remaining 78 items provide the basis for test-taker ability estimates.

For most items, our sequential optimal methods performed very well. However, two testlets, representing over half of the items, were not well estimated by the spiraling design. This is due to the extreme values of some of the item parameters. In these cases, the sequential optimal designs achieved superior performance by using information about the extreme parameter values to push the most informative test takers to these testlets.

The results imply that sequential optimal designs perform much better than spiraling designs, even when items must be given in testlets format. The cost of implementing a sequential optimal design should be made up by avoiding the opportunity loss from use of a spiraling design.

With a sequential optimal design, it is feasible that accuracy of calibration could be tolerable at sample sizes as small as 600. Such a small sample size, versus the sample size for accurate spiraling, has many implications; for instance, sequential optimal methods could calibrate many more items than spiraling in the same time period.

A Simulation Study of Optimal Online Calibration of Testlets Using Real Data (CT-00-07)

Research Report Index