When a person scores exactly 2 standard deviations below the mean on several tests, it is intuitive that the composite score that summarizes these scores should also be exactly 2 standard deviations below the mean. Out intuitions let us down in this case because in this case the composite score is lower than 2 standard deviations. I attempt to make this “composite score extremity effect” a little more intuitive in an Assessment Service Bulletin for the Woodcock-Johnson IV.

Schneider , W. J. (2016). *Why Are WJ IV Cluster Scores More Extreme Than the Average of Their Parts? A Gentle Explanation of the Composite Score Extremity Effect* (Woodcock-Johnson IV Assessment Service Bulletin No. 7). Itasca, IL: Houghton Mifflin Harcourt.

I thank Mark Ledbetter for the invitation to write the paper and support in the writing process, Erica LaForte for patiently editing a complex first draft down to a much more readable version, and Kevin McGrew for additional thoughtful comments and suggestions for improvement on the first draft.

The bulk of the paper is not mathematical. However, the first draft had a few bells and whistles like the animated graph below that shows how the composite score extremity effect is larger as the average correlation among the tests decreases and the number of tests in the composite increases.

Another plot that was originally animated shows what our best guess of a latent variable *X* if we have two indicators *X*_{1} and *X*_{2} that are both exactly 2 standard deviations below the mean. *X*_{1} and *X*_{2} correlate with each other at 0.64 and with *X* at 0.8. If we only know that *X*_{1} = −2, our best guess is that *X* is −1.60. If we know that both *X*_{1} and *X*_{2} are −2, out best guess is that *X* is −1.95. Thus, our estimate is lower with 2 scores (−1.95) than with one score (−1.60).

awesome

The compounding effect of multiple, extreme scores is inversely related to their intercorrelation. A student gifted in both Math and Verbal skills, for example, is obviously more unusual than the average of those percentile scores. However, multiple measures within a single highly intercorrelated domain, such as short-term auditory memory, are presumed to be measuring the same factor. The additional testing will reduce potential regression to the mean, and slightly the shrink standard error of measurement. Both of these factors lower the risk of a type I error arising from failing to adjust alpha for multiple measurement– something rarely mentioned when testing multiple cognitive factors under the PSW model.

For these reasons, I routinely average test scores that are subsumed under the same small-G factor. It is not an exact correction for multiple measurement, but at least runs in the correct direction. Those who artificially– and quite unnecessary– adjust within-factor multiple scores to a more extreme composite figure are going in the wrong direction. They should not be surprised when they overidentify LD students.