Formal Usability Testing of the Computerized LSAT Prototypes at the 1999 Law School Recruitment Forums (CT-01-01)
Kimberly A. Swygert, Jennifer A. Lawlor, Kira Shteinberg
This research was carried out as part of the second phase of work toward the development of a workable prototype for a computerized Law School Admission Test (LSAT). The first phase of the research consisted of the development of seven preliminary LSAT computerized test (CT) prototypes and the demonstration of these prototypes at the 1995 through 1998 Law School Recruitment Forums. At the outset of this new research phase, a total of three entirely new prototypes were developed: Reading Comprehension (RC), Logical Reasoning (LR), and Analytical Reasoning (AR). Formal usability testing of the new prototypes was carried out during the 1999 Law School Recruitment Forums in place of the informal demonstrations that were used earlier.
The prototypes used in 1999 were designed and developed within that year. The development was guided by a combination of three information sources. The first of these sources was the existing literature on operational CTs and computer adaptive tests (CATs), many of which were given under the same circumstances to the same population, and used for roughly the same purpose that a computerized LSAT would be. The second source was the interface design literature, which gave a method of evaluating the user characteristics and performing a task analysis. Development of the task analysis was made simpler by the fact that the LSAT work environment, user goals, and information needs for both the CAT and pencil-and-paper (P&P) forms of the test are the same: the test in either form is essentially a timed, high-stakes, large-scale admission test for law school. Finally, a summary of the LSAT-CT user characteristics data from the 1995–1998 forums provided the most important guidelines for developing the specifications for the new set of prototypes because the users at earlier forums provided a good range of sex and ethnic backgrounds.
The 1999 prototypes all followed the same format of presenting: (1) a first passage/stimulus/set of conditions along with the first five items, (2) a second set of five items, and (3) a scoring screen. The prototypes allowed the users to use tabs (at the top of the screen) or arrow buttons (at the bottom of the screen) to move through each set of five items. When an item was answered, three indicators were provided for the users, and users could rule out potentially wrong answers by pressing the rule-out button to the right of the option text box. No tutorial, set of directions, or help file was provided for the users.
Two kinds of usability tests were performed on the prototypes during the 1999 forum season: low level and mixed level. Low-level usability testing did not require the user to perform complex tasks; the user is instead asked simply to find buttons and use them. In mixed-level testing, the users were first asked to identify tasks and buttons, and then to work through additional parts of the prototypes on their own. Low-level tests were performed on the RC prototype. However, because much of the functionality in the AR and LR prototypes was the same as for the RC prototype, low-level testing was not necessary, and mixed-level testing was carried out on the AR and LR prototypes.
A room was reserved at each of the forum sites to perform the usability tests. The usability test participants were all candidates who had registered for the forum. The users appeared to enjoy the chance to contribute to LSAT research and were excited at the prospect of getting an advance view of the CT. The users also seemed to appreciate the opportunity to interact with Law School Admission Council (LSAC) staff, who were bombarded with questions about the potential computerization of the LSAT.
The results showed that the usability test samples are fairly similar to the LSAT test-taker population in terms of subgroup representation, with the exceptions that the usability test data has a larger proportion of female, African American, and Asian American users, and a smaller proportion of male, Caucasian, and Hispanic users than the LSAT test-taker population. The purpose of the usability tests on these four prototypes was to obtain qualitative data as well as quantitative data in order to get a deeper understanding of how the prototypes should be designed. Most of the sample sizes were small enough that no significance tests were performed on the quantitative data; instead, the percentages of users who performed tasks correctly or in the most efficient manner were tabulated. For the qualitative data, general comments regarding preferences were tallied.
The results for the RC prototype showed that at least 90% of the users selected the tabs for navigation, which indicates both an understanding of and a preference for the tab functions. Any set of tasks requiring users to press only one button was fairly intuitive, but when the users had to understand how the toggle functions work, the usability decreased. Male users were more likely to select RC option letters correctly, but were less likely to use rule-out option buttons correctly. The largest discrepancies in correct performance between African American and Caucasian users were ruling out options, moving forward to the next set, and identifying the number of items answered. Common comments about RC were that the prototype was easy to use (22 mentions), that users enjoyed using it (13 mentions), that they preferred using the tabs (16 mentions), and that they thought the font was too small or thin (17 mentions). Users liked the timer (25 mentions) and wanted to use its functions.
The LR quantitative data showed that users indicated a strong preference for the tab function on this prototype as well, and again, the tasks that required users to press only one button were overwhelmingly intuitive, while the usability decreased quite a bit when users had to understand the rule-out button toggle function. The discrepancy in selecting option letter buttons that appeared for the RC prototype appeared again, and males were more likely to change letter buttons efficiently, to remove the rule-out from options correctly, to identify the number of items in a set and the number of items answered correctly, and to identify correctly where the number of items in a set was indicated on the screen. The most common comments by far were that users enjoyed using the LR prototype (8 mentions) and that it was easy to use (6 mentions). Users liked the timer (10 mentions) and wanted it to be included on an operational CAT.
As with the other two prototypes, the preference for the tabs AR prototype was very high, tasks that required one mouse-click were intuitive, and tasks that required understanding the toggle functions were less intuitive. On this prototype, female users were more likely to change the letter buttons efficiently and were more likely to indicate correctly where on the screen the number of items was provided. African American users were much less likely to change options efficiently, less likely to remove ruled-out options correctly, and less likely to identify the number of items answered on AR correctly.
One of the main problems with the interface was using the toggle function, which users were able to perform, but not in an efficient manner. Certain features in the layout enhanced the usability for users, in particular, the timer. The timer may enhance usability because ultimately the LSAT-CT takers will likely be worried about time. The concerns about font size, readability, and optimal layout were most pronounced on the RC prototype, which is to be expected; users know that the RC passages are long and densely worded, and they are understandably concerned about eyestrain and fatigue.
The users in these datasets were similar in makeup to the forum candidates; compared to the LSAT takers, these users were more likely to be African American and female. When these data were broken down by sex and ethnicity, few differences emerged that were constant across the forum locations and prototypes. There were no consistent ethnic differences across all three prototypes, but the large variance in sample sizes for the three ethnic groups may have prevented real differences from being consistently apparent. These performance results should be incorporated into future development plans and should be assessed by sex and ethnicity when the eventual full-length CAT prototype is examined.