When multiple forms of a test are used — across administrations, years, or testing programs — raw scores on different forms are generally not interchangeable because the forms differ in difficulty. Test score equating is the statistical process of adjusting scores on one form so that they are comparable to scores on another form. The goal is to ensure that examinees are neither advantaged nor disadvantaged by the particular form they happen to receive.
Requirements and Properties
Lord (1980) specified five requirements for equating: the same construct must be measured (equity), the transformation must be symmetric, it should be population-invariant, the function should map the full score range, and equated scores should be interchangeable. These requirements are stringent, and in practice, equating methods satisfy them only approximately. When the forms differ substantially in content, the procedure is more accurately described as linking rather than equating.
This maps Form X scores to Form Y by matching means and standard deviations.
Equipercentile Equating
Equipercentile equating maps scores with equal percentile ranks. A score x on Form X is equated to the score y on Form Y that has the same percentile rank in the population. Formally, eqY(x) = F⁻¹_Y(F_X(x)), where F_X and F_Y are the cumulative distribution functions on the two forms. This method is more flexible than linear equating because it does not assume that the score distributions have the same shape.
where F_X(x) = proportion of examinees scoring ≤ x on Form X
F⁻¹_Y = inverse CDF of Form Y scores
Equating requires a design that links the two forms. In a single-group design, the same examinees take both forms. In a random-groups design, equivalent groups are randomly assigned to different forms. In a common-item nonequivalent groups design (CINEG), different groups take different forms but share a set of anchor items. The CINEG design is most common in operational testing and requires special statistical adjustments (e.g., Tucker, Levine, or frequency estimation methods) to account for group differences.
Kernel Equating and IRT Methods
Kernel equating, developed by von Davier, Holland, and Thayer (2004), provides a unified framework that encompasses linear and equipercentile equating as special cases. It uses kernel smoothing of score distributions to produce continuous equating functions, with a bandwidth parameter that controls the trade-off between linear and equipercentile equating. IRT-based equating places examinees and items on a common latent scale, offering theoretical advantages of population invariance but requiring the IRT model to fit the data adequately.
The choice of equating method depends on the data collection design, sample sizes, and the degree to which distributional assumptions hold. In operational testing programs, equating is performed routinely at each administration, and the accuracy of equating directly affects the fairness and comparability of reported scores. Small equating errors can have substantial consequences when scores are used for high-stakes decisions near cut-points.