Skip to contents

Some Obstacles to Using IRT

One obstacle to using IRT models for clinical assessment is that the relationship between the responses and the resulting score estimate is more complicated than with number-correct or percent-correct scoring. However, many clinical assessments in speech-language pathology appropriately rely on norm-referenced scores that are typically constructed by scaling number correct scores to the mean and standard deviation of a relatively large sample from the population of interest. That approach is available in an IRT framework as well and we have chosen to scale the score estimates such that they have a mean of 50 and standard deviation of 10 in the calibration sample. Of note, while norm-referenced scores are often non-linearly transformed to approximate a normal distribution, we have not done so in the current application, even though the distribution of scores in the calibration sample is negatively skewed. We made this choice because such a transformation would have further complicated the relationship between item difficulty, person ability, and the expected score or predicted response. As such, users of this application should be aware that although a T-score of 50 represents the estimated mean score for persons with aphasia and a score of 60 indicates a score that is one standard deviation above the mean, these scores do not correspond to the percentile ranks associated with these values in a normal score distribution. However, the differences are relatively small: The T-score at the 50th percentile (median) of the calibration sample was 51.9, and a score of 60 is at the 87th percentile, whereas it would be expected to fall at the 84th percentile if the scores were normally distributed.

A second obstacle to using IRT models in clinical assessment is that they rely on relatively strong assumptions about the data and the nature of the underlying construct being assessed. One such assumption is that the items included in the assessment respond to a single underlying construct or dimension. In the present context, application of the 1-PL IRT model requires the explicit assumption that the ability to name common objects can be represented by a single dimension, trait, or ability. Thus, use of the present application, and for that matter any total score metric for a naming assessment, ignores the well-supported view that naming ability in aphasia can be impaired along multiple dimensions, e.g., semantics, lexical selection, and phonology. More complicated models are required to measure abilities that are expressed along multiple dimensions (e.g., Walker et al., 2018) and such models are more challenging to apply to adaptive testing. While the use of a unidimensional model for the PNT is an oversimplification of reality, it is a useful simplification that is well established in the field of clinical aphasiology.

A third obstacle to the implementation of IRT in clinical assessment is that they require relatively large amounts of data (i.e., large study samples) to support testing of model assumptions, validation of model predictions, and robust estimation of item parameters. Fortunately, the Moss Aphasia Psycholinguistics Project Database (Mirman et al., 2010) provided an excellent foundation for the line of research that has led to the present PNT-CAT application.

A fourth issue concerns a specific limitation of the current PNT-CAT application. A computer adaptive test requires a calibrated item bank from which items can be selected for administration. In the current context, an item bank is a set of items that have been shown to respond to a single underlying trait or ability (i.e., is unidimensional) and that have been located on a common metric or scale. In the 1-PL IRT model, locating items on a common scale involves estimating their difficulty relative to one another and to the sample of individuals providing the response data, a process referred to as calibration. One issue for the present PNT-CAT application is that the item bank is limited to the 175 items in the full PNT. To function optimally, a CAT requires a large number of items distributed across the full range of difficulty. While the PNT item bank contains a relatively large number of items in the lower-middle range of the difficulty/ability scale (approximately between 40 and 55 on the T-score scale), and it has somewhat fewer items at the lower end of the spectrum below 40 that can be targeted to individuals with severe anomia and very few items at the upper end of the scale that can be administered to individuals with very mild anomia. Practically, this means that if a person with severe anomia is being given the PNT-CAT30 and has responded incorrectly to the 10 easiest items in the PNT (e.g., ‘cat’, ‘key’, ‘eye’. ‘bed’, ‘ear’, ‘nose’, ‘baby’, ‘hand’, ‘hat’, and ‘tree’), it becomes increasingly unlikely that they will get each subsequent item correct, and each new item provides progressively less information. Likewise, when a person with mild aphasia has responded correctly to the 10 hardest items (‘stethoscope’, ‘microscope’, ‘pyramid’, ‘binoculars’, ‘volcano’, ‘ambulance’, ‘dinosaur’, ‘cheerleaders’, ‘thermometer’, and ‘slippers’), it becomes increasingly likely that they will get each subsequent item correct and continuing to administer additional items provides progressively decreasing information. At the same time, it should be noted that persons with aphasia commonly experience substantial moment-to-moment variability in their naming behavior, and unexpected correct and incorrect responses can and do occur at rates that are somewhat greater than predicted by the 1-PL model (Fergadiotis et al., 2015; Huston, 2021). Despite this variability, we have found that the PNT-CAT30 produces scores that are highly correlated with the full PNT (r = 0.95; Fergadiotis, Hula, Swiderski, Lei, Kellough, 2019) and that the agreement between the PNT-CAT30 and PNT-CATVL is also high (r = 0.89; Hula, Fergadiotis, Swiderski, Silkes, and Kellough, 2019).