# Why averaging multiple IQ scores is incorrect in death penalty cases

As I have explained elsewhere on this blog, when a person has been given multiple IQ tests, it is common practice to take the mean IQ or median IQ to determine eligibility for the death penalty. As long as all the scores are valid estimates, combining multiple scores results in more accurate measurement.

Unfortunately, taking the mean or median IQ score is one of those solutions that is simple, neat, and wrong. Why? In the graph below, there are two IQ tests that correlate at 0.9. On each test, the population mean is μ = 100 and the standard deviation is σ = 15. On either test alone, about 2.3% of people score 70 or less, the typical threshold at which a person is ineligible for the death penalty. What percent of people score 70 or less on the average of the 2 tests? About 2%. Why is it 2% instead of 2.3%? The smaller number occurs because the tests, though highly correlated, are not perfectly correlated. The average of the 2 tests has population mean of μ = 100 but its standard deviation is smaller than 15. In this case, the standard deviation is σ = 14.62. The fact that the standard deviation of the average of two scores is smaller results in fewer people below the threshold of 70 than is the case if just one test had been given.

There is an established procedure for rescaling a composite score so that it has the correct mean and standard deviation. It is the same procedure that was applied to the IQ subtest scores in the calculation of the full scale IQ. This same procedure should be applied when multiple IQ scores have been given.

Assuming that all the IQ scores have a mean of μ = 100 and a standard deviation of σ = 15, the composite IQ of k scores is: $\text{Composite IQ}=\dfrac{\text{Sum of the IQ scores}-100k}{\sqrt{\text{Sum of the correlation matrix}}}+100$

In the graph above, the diagonal axis represents the composite IQ with the proper scaling so that the composite IQ has a mean of 100 and a standard deviation of 15 (instead of 14.62). As stated previously, if the 2 IQ tests were simply averaged, only about 2.0% score 70 or less. On a properly scaled IQ score, 2.0% corresponds to an IQ of 69.

Does 1 point matter? It does to the person who on average scored 71 on the 2 IQ tests. That person, with the score properly rescaled, would have a composite IQ of  70 and thus would be deemed ineligible for execution.

Your intuition might be telling you that something is fishy about all this. Does this mean that whenever someone scores 71 on an IQ test, just missing the threshold, that another test should be given, resulting in another score of 71 so that the composite score is 70? The answer is that your intuition (and mine) is often unreliable when it comes to probability. As I have explained in this video, most people who score 71 on one IQ test score higher than 71 on a second IQ test. As long as all the scores are properly rescaled, the composite IQ is more accurate and nothing fishy is happening.

This procedure should not be applied mechanically in all situations. The method assumes that each score is equally valid and thus has equal weight. There are reasons to prefer some IQ administrations over others (e.g., a full battery given by a licensed clinician is likely to be more accurate than an abbreviated IQ test given by a first-year graduate student). If there are reasons to dismiss a particular score (e.g., the evaluee intentionally tried to obtain a low score), it should not figure into the composite score. There are further complications not discussed here such as the fact that people tend to score higher when retested with the same test (or one that is very similar).

Standard

## 4 thoughts on “Why averaging multiple IQ scores is incorrect in death penalty cases”

1. Pingback: Magapsine 30/01/2014 | dronte.es

2. Gary L. Canivez says:

True indeed. But even more problematic are results from a newly published study (McDermott, Watkins, & Rhode, 2013, November) where it was estimated that 12.5% of the variability in Wechsler Intelligence Scale for Children-Fourth Edition (WISC-IV, Wechsler, 2003) was due to examiner variability and not child variability! Thus, the commonly reported standard error of measurement estimated by estimates of internal consistency is insufficient, especially near clinical decision boundaries like that of Intellectual Disability. This article is a must read for anyone providing cognitive assessment as it is likely that this problem is not unique to the WISC-IV.

McDermott, P. A., Watkins, M. W., & Rhoad, A. M. (2013, November 4). Whose IQ Is It?—Assessor Bias Variance in High-Stakes Psychological Assessment. Psychological Assessment. Advance online publication. doi: 10.1037/a0034832

• W. Joel Schneider says:

I agree. It is a must-read article. Very sobering.