Item Response Theory Method and Application Gaining Support as Assessment Instrument

Patrick A. Palmieri
University of Illinois at Urbana-Champaign

Most psychological assessment instruments-including those that measure trauma and its consequences-have been developed using what is known as classical test theory (CTT). Many researchers may not be aware, however, that CTT carries with it certain disadvantages. One disadvantage is that characteristics of the items (and the scale as a whole) depend upon the particular sample on which they are calculated. In addition, a person's score on a measure of a particular trait or construct (e.g., PTSD) depends upon the specific items that are selected. Therefore, everyone must be administered the exact same item set if they are to be placed on the same dimension or metric.

Measurement based on what is known as item response theory (IRT) does not carry these disadvantages. In IRT, item characteristics are not sample-dependent and scores calculated on IRT-based information are not linked to a specific item set. This more contemporary approach has garnered increasing support from psychometricians in recent years.

IRT is a model-based version of test theory that uses a mathematical function to describe the relation between a person's standing on a latent trait and his/her item responses. When an appropriate model is selected, the likelihood that a person will respond to an item in the keyed direction is a function of the person's standing on the underlying construct and the item's difficulty and discriminability. All of these parameters are estimated through maximum likelihood methods and graphically represented in an item characteristic curve (ICC; see Figure 1).

Item difficulty refers to the trait level at which an individual has a 50 percent chance of endorsing the item; typically it is the point at which the curve steepens most dramatically. The example item in Figure 1 is rather difficult, as only individuals with moderate to high standing on the latent trait are likely to endorse it (note that a probability of endorsement of .50 corresponds to the value 1.5 on the horizontal axis). Item discriminability refers to how well the item differentiates between individuals of different latent trait levels and is reflected in the slope of the curve at different regions along the horizontal axis. The ICC example portrays an item that is poor at discriminating between individuals with low and moderate levels of the latent trait (note the flat slope between -3 and -1 on the horizontal axis) but good at discriminating between individuals with moderate and high trait levels (note the steeper slope at higher trait values). In IRT practice, every item is administered to a large sample and calibrated as such that the ICC and its parameters are known.

IRT is useful for interpreting mean differences across groups (e.g., gender, race, trauma type). If a score difference is observed between two or more racial groups on a PTSD measure, for example, it could be due to a true group difference or to one or more items that function differently in the groups. Such differential item functioning or item bias might indicate that the PTSD construct is not the same across groups. Item bias is exposed when individuals from different groups have the same standing on the latent trait but different ICCs and, hence, different probabilities of endorsing an item. Thus, ruling out item bias or establishing measurement equivalence is important before concluding that true group differences exist. IRT, unlike CTT, is well suited for this task because its item characteristics are sample-independent.

IRT also is the basis for computerized adaptive testing (CAT), a real-time procedure for tailoring a test by administering only those items that will provide maximum information about an individual's standing on the latent trait. To accomplish this, the person's latent trait level is estimated after each item response on the test. Then, typically, the computer administers an item with a difficulty parameter near that of the current trait estimate and, thus, has a 50 percent chance of endorsement. This procedure is repeated until the standard error of measurement for the person's trait level is below some acceptable level. By excluding items that are too easy or too difficult for a person, measurement precision is enhanced as test length is shortened. With IRT, unlike CTT, an individual does not have to take the same items in order to be scored on a common metric with all other test-takers.

Although IRT analyses require minimum sample sizes in the hundreds, the statistical assumptions are relatively easy to satisfy, and the necessary software is becoming increasingly available and user-friendly. Scientific Software International (www.ssicentral.com) markets BILOG, one of the most commonly used IRT programs.

For more information about IRT and its applications, see Item Response Theory for Psychologists, by Embretson and Reise, published by Lawrence Erlbaum Associates Inc., 2000; and Fundamentals of Item Response Theory, by Hambleton, Swaminathan and Rogers, published by Sage Publications, 1991. Additional information, including tutorials, can be found on the web at http://work.psych.uiuc.edu/irt and http://ericae.net/irt.

This brief report is sponsored by ISTSS's Special Interest Group on Research Methodology. If you are interested in becoming a member of this SIG, contact chairs Daniel and Lynda King at king.daniel@boston.va.gov or lking@world.std.com. Palmieri will be presenting a workshop on IRT at the upcoming annual meeting of ISTSS in Baltimore, Md.