# IQ, the death penalty, and me

Today the Supreme Court of the United States ruled that, in death penalty cases, the state of Florida must take into account the inherent imprecision of IQ tests.

Why are IQ tests used in death penalty cases? It is unconstitutional to execute a person deemed to be intellectually disabled (Intellectual disability is the current term for what was previously known as mental retardation.). Diagnosing intellectual disabilities is a complex matter but the diagnosis hinges to a large degree on the person’s performance on a well-constructed IQ test. Although high-quality IQ tests are more reliable than most psychological measures, even the best IQ tests are imperfectly precise. There is a potentially large risk that a person with an observed score slightly above the threshold set by Florida law may have a “true score” that is below the threshold.

It was an unexpected honor to have my work cited in both the court’s decision (written by Justice Kennedy) and the dissenting opinion (written by Justice Alito). My contribution to the argument (relevant portion reproduced here) is a technical one and played an admittedly small role in the proceedings . My main point was that when multiple IQ tests have been administered to the same individual, we should not average the scores but make them into a composite score in the same way that we combine psychological scores in any other context. Doing so gives a more accurate estimate of the IQ and a smaller confidence interval around the score. I hope that the application of this procedure results in fewer incorrect decisions and a fairer administration of justice.

I am grateful to Cecil Reynolds for giving me the opportunity to write the paper and to Kevin McGrew for encouraging me to re-write and publish the argument on the web, demonstrating its application to death penalty cases. Although it was the published chapter that was cited and used by the defense, it was the free web version that initially caught the attention of the law firm representing the defendant.

Standard
Psychometrics

# Why composite scores are more extreme than the average of their parts

Suppose that two tests have a correlation of 0.6. On both tests an individual obtained an index score of 130, which is 2 standard deviations above the mean. If both tests are combined, what is the composite score?

Our intuition is that if both tests are 130, the composite score is also 130. Unfortunately, taking the average is incorrect. In this example, the composite score is actually 134. How is it possible that the composite is higher than both of the scores?

If I measure the length of a board twice or if I take the temperature of a sick child twice, the average of the results is probably the best estimate of the quantity I am measuring. Why can’t I do this with standard scores?

Standard scores do not behave like many of our most familiar units of measurement. Degrees Celsius have meaning in reference to a standard, the temperature at which water freezes at sea level. In contrast, standard scores do not have meaning compared to some absolute standard. Instead, the meaning of a standard score derives from its position in the population distribution. One way to describe the position of a score is its distance from the population mean. The size of this distance is then compared to the standard deviation, which is how far scores typically are from the population mean (more precisely, the standard deviation is the square root of the average squared distance from the mean). Thus, the “standard” to which standard scores are compared are the mean and standard deviation.

An index score of 130 is 2 standard deviations above the mean of 100.

The average of two imperfectly correlated index scores is not an index score. Its standard deviation is smaller than 15 and thus our sense of what index scores mean does not apply to the average of two index scores. To make sense of the composite score, we must convert it into an index score that has a standard deviation of 15.

$\dfrac{(130+130-2*100)}{\sqrt{2+2*0.6}}+100\approx 134$

How is this possible? It is unusual for someone to score 130. It is even more unusual for someone to score 130 on two tests that are imperfectly correlated. The less correlated the tests, the more unusual it is to score high on both tests.

Below is a geometric representation of this phenomenon. Correlated tests can be graphed with oblique axes (as is done in factor analyses with oblique rotations). The cosine of the correlation is the angle between the axes. As seen below, the lower the correlation, the more extreme the composite. As the correlation approaches 1, the composite approaches the average of the scores.

The lower the correlation, the more extreme the composite score.

If the scores are lower than the population mean, the composite score is lower than the average of the parts. For example, if the two scores are 71, and the correlation between the scores is 0.9, the composite score is 70.

When the subtest scores are below the mean, the composite score is lower than the average of the subtest scores.

In a previous post, I presented this material in greater detail.

Standard

# Can’t Decide Which IQ Is Best? Make a Composite Score.

A man with a watch knows what time it is. A man with two watches is never sure.

-Segal’s Law

Suppose you have been asked to settle a matter with important implications for an evaluee.[1] A young girl was diagnosed with mental retardation [now called intellectual disability] three years ago. Along with low adaptive functioning, her Full Scale IQ was a 68, two points under the traditional line used to diagnose [intellectual disability]. Upon re-evaluation two months ago, her IQ, derived from a different test, was now 78. Worried that their daughter would no longer qualify for services, the family paid out of pocket to have their daughter evaluated by another psychologist and the IQ came out as 66. Because of your reputation for being fair-minded and knowledgeable, you have been asked to decide which, if any, is the real IQ. Of course, there is no such thing as a “real IQ” but you understand what the referral question is.

You give a different battery of tests and the girl scores a 76. Now what should be done? It would be tempting to assume that, “Other psychologists are sloppy, whereas my results are free of error.” However, you are fair minded. You know that all scores have measurement error and you plot the scores and their 95% confidence intervals as seen in Figure 1.

Figure 1

Recent IQ Scores and their 95% Confidence Intervals from the Same Individual

It is clear that Test C’s confidence interval does not overlap with those of Tests B and D. Is this kind of variability in scores unusual?[2] There are two tests that indicate an IQ in the high 60’s and two tests that indicate an IQ in the high 70’s. Which pair of tests is correct? Should the poor girl be subjected to yet another test that might act as a tie breaker?

Perhaps the fairest solution is to treat each IQ test as subtests of a much larger “Mega-IQ Test.” That is, perhaps the best that can be done is to combine the four IQ scores into a single score and then construct a confidence interval around it.

Where should the confidence interval be centered? Intuitively, it might seem reasonable to simply average all four IQ results and say that the IQ is 72. However, this is not quite right. Averaging scores gives a rough approximation of a composite score but it is less accurate for low and high scorers than it is for scorers near the mean. An individual’s composite score is further away from the population mean than the average of the individual’s subtest scores. About 3.1% of people score a 72 or lower on a single IQ test (assuming perfect normality). However, if we were to imagine a population of people who took all four IQ tests in question, only 1.9% of them would have an average score of 72 or lower. That is, it is more unusual to have a mean IQ of 72 than it is to score a 72 IQ on any particular IQ test. It is unusual to score 72 on one IQ test but it is even more unusual to score that low on more than one test on average. Another way to think about this issue is to recognize that the mean score cannot be interpreted as an IQ score because it has a smaller standard deviation than IQ scores have. To make it comparable to IQ, it needs to be rescaled so that it has a “standard” standard deviation of 15.

Here is a good method for computing a composite score and its accompanying 95% confidence interval. It is not nearly as complicated as it might seem at first glance. This method assumes that you know the reliability coefficients of all the scores and you know all the correlations between the scores. All scores must be index scores (μ = 100, σ = 15). If they are not, they can be converted using this formula:

$\text{Index Score} = 15(\dfrac{X-\mu}{\sigma})+100$

## Computing a Composite Score

Step 1: Add up all of the scores.

In this case,

$68 + 78 + 66 + 76 = 288$

Step 2: Subtract the number of tests times 100.

In this case there are 4 tests. Thus,

$288-4 * 100 = 288-400 = -112$

Step 3: Divide by the square root of the sum of all the elements in the correlation matrix.

In this case, suppose that the four tests are correlated as such:

 Test A Test B Test C Test D Test A 1 0.80 0.75 0.85 Test B 0.80 1 0.70 0.71 Test C 0.75 0.70 1 0.78 Test D 0.85 0.71 0.78 1

The sum of all 16 elements, including the ones in the diagonal is 13.18. The square root of 13.18 is about 3.63. Thus,

$-112 / 3.63 = -30.85$

Step 4: Complete the computation of the composite score by adding 100.

In this case,

$-30.82 + 100 = 69.18$

Given the four IQ scores available, assuming that there is no reason to favor one above the others, the best estimate is that her IQ is 69. Most of the time, there is no need for further calculation. However, we might like to know how precise this estimate is by constructing a 95% confidence interval around this score.

## Confidence Intervals of Composite Scores

Calculating a 95% confidence interval is more complicated than the calculations above but not overly so.

Step 1: Calculate the composite reliability.

Step 1a: Subtract the number of tests from the sum of the correlation matrix.

In this case, there are 4 tests. Therefore,

$13.18-4 = 9.18$

Step 1b: Add in all the test reliability coefficients.

In this case, suppose that the four reliability coefficients are 0.97, 0.96, 0.98, and 0.97. Therefore,

$9.18 + 0.97 + 0.96 + 0.98 + 0.97 = 13.06$

Step 1c: Divide by the original sum of the correlation matrix.

In this case,

$13.06 / 13.18 \approx 0.9909$

Therefore, in this case, the reliability coefficient of the composite score is higher than that of any single IQ score. This makes sense, given that we have four scores, we should know what her IQ is with greater precision than we would if we only had one score.

Step 2: Calculate the standard error of the estimate by subtracting the reliability coefficient squared from the reliability coefficient and taking the square root. Then, multiply by the standard deviation, 15.

In this case,

$15\sqrt{0.9909-0.9909^2}\approx 1.4247$

Step 3: Calculate the 95% margin of error by multiplying the standard error of the estimate by 1.96.

In this case,

$1.96 * 1.44247 \approx 2.79$

The value 1.96 is the approximate z-score associated with the 95% confidence interval. If you want the z-score associated with a different margin of error, then use the following Excel formula. Shown here is the calculation of the z-score for a 99% confidence interval:

$=\mathrm{NORMSINV}(1-(1-0.99)/2)$

Step 4: Calculate the estimated true score by subtracting 100 from the composite score, multiplying the reliability coefficient, and adding 100. That is,

$\text{Estimated True Score} =\text{Reliability Coefficient} * (\text{Composite} - 100) + 100$

In this case,

$0.9909*(69.18-100)+100=69.46$

Step 5: Calculate the upper and lower bounds of the 95% confidence interval by starting with the estimated true score and then adding and subtracting the margin of error.

In this case,

$69.46 \pm 2.79 = 66.67 \text{ to } 72.25$

This means that we are 95% sure that her IQ is between about 67 and 72. Assuming that other criteria for mental retardation [intellectual disability] are met, this is in the range to qualify for services in most states. It should be noted that this procedure can be used for any kind of composite score, not just for IQ tests.

[2] This degree of profile variability is not at all unusual. In fact, it is quite typical. A statistic called the Mahalanobis Distance (Crawford & Allen, 1994) can be used to estimate how typical an individual profile of scores is compared to a particular population of score profiles. Using the given correlation matrix and assuming multivariate normality, this profile is at the 86th percentile in terms of profile unusualness…and almost of all of the reason that it is unusual is that its overall elevation is unusually low (Mean = 72). If we consider only those profiles that have an average score of 72, this profile’s unusualness is at the 54th percentile (Schneider, in preparation). That is, the amount of variability in this profile is typical compared to other profiles with an average score of 72.

This post is an excerpt from:

Schneider, W. J. (2013). Principles of assessment of aptitude and achievement. In D. Saklofske, C. Reynolds, & V. Schwean (Eds.), Oxford handbook of psychological assessment of children and adolescents (pp. 286–330). New York: Oxford.

Figure 1 was updated from the original to show more accurately the precision that IQ scores have.

Standard

# Real Composite Scores vs. Averaged Pseudo-Composite Scores

Kevin McGrew and I wrote a position paper about the factors that influence the inaccuracy of averaged pseudo-composite scores compared to real composite scores. Averaged pseudo-composites are simple averages of test scores. For example, if a person scores 6 on WISC-IV Digit Span and 4 on WISC-IV Letter-Number Sequencing, an averaged pseudo-composite score measuring “Working Memory” would be 5, in the index score metric, 75. In truth, the real composite score should be lower than 75, depending on the correlation between the two subtests.