Computerized Adaptive Testing Simulations Using Real Test Taker Responses (CT-96-06)
by Xiang Bo Wang, WeiQin Pan, and Vincent Harris

Executive Summary

Simulations have played an important role in the research of computerized adaptive testing (CAT), ranging from the study of estimation accuracy, item exposure, test security, to item differential functioning. However, most researchers would agree that few simulations, regardless of how well conducted, can fully capture the psychological and behavioral reality of examinee performance on a test. This is especially true when it comes to simulating individual test taker responses.

In order to maintain the full information embodied in test taker responses, this study used the real responses of 969 Law School Admission Test (LSAT) takers to 127 items to simulate a simple CAT. (A practical constraint of using real data is the limited sizes of item and test taker pools.) The objective of this study was to use the original ability estimates of these test takers obtained from their paper-and-pencil (P & P) LSAT as targets to see how well simple CAT sessions could recover them. Three basic research questions were investigated: (1) Can the limited number of items (127) of one P & P test be adequate for a CAT session? (2) How accurately can such a CAT recover individual test taker abilities? (3) How many items are required by CAT sessions with three accuracy levels-standard error of measurement (SEM) equal to 0.3162, 0.265, and 0.173, which are equivalent to the classical reliabilities of 0.90,0.93 and 0.97, respectively? Note that the reliability of the LSAT is normally about 0.93.

Three major findings have been obtained. First, the 127 items have been shown to be sufficient to conduct simple CAT sessions for virtually all test takers at the three accuracy levels. Only two test takers' CAT sessions could not be completed at the highest accuracy level. The second finding is that given the current item pool composition, the highest accuracy level seemed to be necessary to recover the original P & P ability estimates of the test takers. The third major finding is that the average number of items used in CAT sessions increased from 13 items for the lowest accuracy level, to 37 items for the highest accuracy level. The overall significance of this study is that it is the first CAT simulation that used real test taker responses.

