Some Advantages of Using IRT in Clinical Assessment
One major advantage of IRT scaling is that it permits clinicians to administer different subsets of items to different individuals or to the same individual on different occasions and obtain scores that are on the same scale and thus directly comparable. Direct comparison of percent correct or number correct scores across different tests or item subsets is not possible because the total number correct and its linear transformations do not take into account the relative difficulty of the items. For example, although a total correct score of 70 on the PNT and a total correct score of 24 on the Boston Naming Test (BNT; Kaplan et al., 2001) both correspond to 40% correct, they do not reflect the same level of anomia severity because the BNT items are, on average more difficult than the items contained in the PNT. The 1-PL IRT model places items and persons on a common scale and the scores derived from that scale do not depend on the particular items administered, assuming adequate model fit and application to the appropriate clinical population (i.e., stroke survivors with aphasia). As explained below, this facilitates adaptive testing and assessment with alternative test forms that do not repeat specific item content and thus avoid some kinds of test practice effects.
A second advantage of IRT modeling is that it provides individualized estimates of the measurement error associated with each score estimate, referred to as the standard error or standard error of measurement. The standard error of measurement, can be used to determine the extent to which the difference between two individual score estimates exceed measurement error and are thus likely to represent a real underlying difference. In the present application, the amount of measurement error associated with a given score estimate is determined primarily by two factors: (1) the number of items administered and (2) the degree to which the difficulty of the administered items is targeted to the ability of the person being tested. All other things equal, administering more items will decrease measurement error and increase the precision of a score estimate. Regarding item-person targeting, it is intuitively apparent that if a group of persons with severe aphasia are given a test composed entirely of easy-to-name items, such as ‘cat’, ‘ear’, ‘bed’, ‘hat’, and ‘tree’, their scores will vary meaningfully and it will be possible to rank order them in terms of the number they name correctly. On the other hand, if that same group of individuals with severe aphasia were given a test composed of difficult-to-name items, such as ‘microscope’, ‘stethoscope’, ‘pyramid’, ‘binoculars’, and ‘volcano’, they would likely all receive the same minimum score and rank ordering them in terms of number correct would not be possible. Conceptually, when item difficulty and person ability are closely matched, the predicted response is close to 0.5, it is uncertain whether the item will be named correctly, and thus the observed response provides more useful information. When item difficulty and person ability are further apart, the expected score approaches 0 or 1, a more confident prediction about whether the item will be named correctly can be made, and the observed response provides correspondingly less new information. For an interactive graphical presentation of of the relationship between participant ability and item difficulty with predictive probability of a correct response see https://aswiderski.shinyapps.io/ShinyIRT/.
A third major advantage of IRT modeling is that it can be used to support computer adaptive test (CAT) administration. The primary advantage of adaptive testing is that it minimizes the loss of measurement precision associated with shortening a test, especially for individuals with particularly high or particularly low ability. For the current PNT application, CAT will provide the most benefit relative to a static short form for individuals with especially mild or severe anomia. The adaptive versions of the PNT begin with an item of average difficulty (e.g., ‘pumpkin’) and select each new item to target the current naming score estimate based on the responses collected up to that point.