Using the multivariate truncated normal distribution

In a previous post, I imagined that there was a gifted education program that had a strictly enforced selection procedure: everyone with an IQ of 130 or higher is admitted. With the (univariate) truncated normal distribution, we were able to calculate the mean of the selected group (mean IQ = 135.6). Multivariate Truncated Normal Distributions

Reading comprehension has a strong relationship with IQ $(\rho\approx 0.70)$. What is the average reading comprehension score among students in the gifted education program? If we can assume that reading comprehension is normally distributed $(\mu=100, \sigma=15)$ and the relationship between IQ and reading comprehension is linear $(\rho=0.70)$, then we can answer this question using the multivariate truncated normal distribution. Portions of the multivariate normal distribution have been truncated (sliced off). In this case, the blue portion of the bivariate normal distribution of IQ and reading comprehension has been sliced off. The portion remaining (in red), is the distribution we are interested in. Here it is in 3D:

Here is the same distribution with simulated data points in 2D:

Expected Values

In the picture above, the expected value (i.e., mean) for the IQ of the students in the gifted education program is 135.6. In the last post, I showed how to calculate this value.

The expected value (i.e., mean) for the reading comprehension score is 124.9. How is this calculated? The general method is fairly complicated and requires specialized software such as the R package tmvtnorm. However in the bivariate case with a single truncation, we can simply calculate the predicted reading comprehension score when IQ is 135.6: $\dfrac{\hat{Y}-\mu_Y}{\sigma_Y}=\rho_{XY}\dfrac{X-\mu_X}{\sigma_X}$ $\dfrac{\hat{Y}-100}{15}=0.7\dfrac{135.6-100}{15}$ $\hat{Y}=124.9$

In R, the same answer is obtained via the tmvtnorm package:

library(tmvtnorm)
#Variable names

#Vector of Means
mu<-c(100,100)
names(mu)<-vNames;mu

#Vector of Standard deviations
sigma<-c(15,15)
names(sigma)<-vNames;sigma

#Correlation between IQ and Reading Comprehension
rho<-0.7

#Correlation matrix
R<-matrix(c(1,rho,rho,1),ncol=2)
rownames(R)<-colnames(R)<-vNames;R

#Covariance matrix
C<-diag(sigma)%*%R%*%diag(sigma)
rownames(C)<-colnames(C)<-vNames;C

#Vector of lower bounds (-Inf means negative infinity)
a<-c(130,-Inf)

#Vector of upper bounds (Inf means positive infinity)
b<-c(Inf,Inf)

#Means and covariance matrix of the truncated distribution
m<-mtmvnorm(mean=mu,sigma=C,lower=a,upper=b)
rownames(m$tvar)<-colnames(m$tvar)<-vNames;m

#Means of the truncated distribution
tmu<-m$tmean;tmu #Standard deviations of the truncated distribution tsigma<-sqrt(diag(m$tvar));tsigma

#Correlation matrix of the truncated distribution
tR<-cov2cor(m\$tvar);tR

In running the code above, we learn that the standard deviation of reading comprehension has shrunk from 15 in the general population to 11.28 in the truncated population. In addition, the correlation between IQ and reading comprehension has shrunk from 0.70 in the general population to 0.31 in the truncated population.

Marginal cumulative distributions

Among the students in the gifted education program, what proportion have reading comprehension scores of 100 or less? The question can be answered with the marginal cumulative distribution function. That is, what proportion of the red truncated region is less than 100 in reading comprehension? Assuming that the code in the previous section has been run already, this code will yield the answer of about 1.3%:

#Proportion of students in the gifted program with reading comprehension of 100 or less
ptmvnorm(lowerx=c(-Inf,-Inf),upperx=c(Inf,100),mean=mu,sigma=C,lower=a,upper=b)

The mean, sigma, lower, and upper parameters define the truncated normal distribution. The lowerx and the upperx parameters define the lower and upper bounds of the subregion in question. In this case, there are no restrictions except an upper limit of 100 on the second axis (the Y-axis).

If we plot the cumulative distribution of reading comprehension scores in the gifted population, it is close to (but not the same as) that of the conditional distribution of reading comprehension at IQ = 135.6.

What proportion does the truncated distribution occupy in the untruncated distribution?

Imagine that in order to qualify for services for intellectual disability, a person must score 70 or below on an IQ test. Every three years, the person must undergo a re-evaluation. Suppose that the correlation between the original test and the re-evaluation test is $\rho=0.90$. If the entire population were given both tests, what proportion would score 70 or lower on both tests? What proportion would score below 70 on the first test but not on the second test? Such questions can be answered with the pmvnorm function from the mvtnorm package (which is a prerequiste of the tmvtnorm package and this thus already loaded if you ran the previous code blocks).

library(mvtnorm)
#Means
IQmu<-c(100,100)

#Standard deviations
IQsigma<-c(15,15)

#Correlation
IQrho<-0.9

#Correlation matrix
IQcor<-matrix(c(1,IQrho,IQrho,1),ncol=2)

#Covariance matrix
IQcov<-diag(IQsigma)%*%IQcor%*%diag(IQsigma)

#Proportion of the general population scoring 70 or less on both tests
pmvnorm(lower=c(-Inf,-Inf),upper=c(70,70),mean=IQmu,sigma=IQcov)

#Proportion of the general population scoring 70 or less on the first test but not on the second test
pmvnorm(lower=c(-Inf,70),upper=c(70,Inf),mean=IQmu,sigma=IQcov)

What are the means of these truncated distributions?

#Mean scores among people scoring 70 or less on both tests
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,-Inf),upper=c(70,70))

#Mean scores among people scoring 70 or less on the first test but not on the second test
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,70),upper=c(70,Inf))

Combining this information in a plot: Thus, we can see that the multivariate truncated normal distribution can be used to answer a wide variety of questions. With a little creativity, we can apply it to many more kinds of questions.

Standard

Conditional normal distributions provide useful information in psychological assessment

Conditional normal distributions are really useful in psychological assessment. We can use them to answer questions like:

• How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?
• If that person also has a score of 80 on a test of working memory capacity, how much does the risk of scoring 90 or lower on reading comprehension increase?

What follows might be mathematically daunting. Just let it wash over you if it becomes confusing. At the end, there is a video in which I will show how to use a spreadsheet that will do all the calculations.

Unconditional Normal Distributions

Suppose that variable Y represents reading comprehension test scores. Here is a fancy way of saying Y is normally distributed with a mean of 100 and a standard deviation of 15: $Y\sim N(100,15^2)$

In this notation, “~” means is distributed as, and N means normally distributed with a particular mean (μ) and variance (σ2).

If we know literally nothing about a person from this population, our best guess is that the person’s reading comprehension score is at the population mean. One way to say this is that the person’s expected value on reading comprehension is the population mean: $E(Y)=\mu_Y = 100$

The 95% confidence interval around this guess is : $95\%\, \text{CI} = \mu_Y \pm z_{95\%} \sigma_Y$ $95\%\, \text{CI} \approx 100 \pm 1.96*15 = 70.6 \text{ to } 129.4$

Conditional Normal Distributions

Simple Linear Regression

Now, suppose that we know one thing about the person: the person’s score on a vocabulary test. We can let X represent the vocabulary score and its distribution is the same as that of Y: $X\sim N(100,15^2)$

If we know that this person scored 120 on vocabulary (X), what is our best guess as to what the person scored on reading comprehension (Y)? This guess is a conditional expected value. It is “conditional” in the sense that the expected value of Y depends on what value X has. The pipe symbol “|” is used to note a condition like so: $E(Y|X=120)$

This means, “What is our best guess for Y if X is 120?”

What if we don’t want to be specific about the value of X but want to refer to any particular value of X? Oddly enough, it is traditional to use the lowercase x for that. So, X refers to the variable as a whole and x refers to any particular value of variable X. So if I know that variable X happens to be a particular value x, the expected value of Y is: $E(Y|X=x)=\sigma_Y \rho_{XY}\dfrac{x-\mu_X}{\sigma_X}+\mu_Y$

where ρXY is the correlation between X and Y.

You might recognize that this is a linear regression formula and that: $E(Y|X=x)=\hat{Y}$

where “Y-hat” (Ŷ) is the predicted value of Y when X is known.

Let’s assume that the relationship between X and Y is bivariate normal like in the image at the top of the post: $\begin{bmatrix}X\\Y\end{bmatrix}\sim N\left(\begin{bmatrix}\mu_X\\ \mu_Y\end{bmatrix}\begin{matrix} \\,\end{matrix}\begin{bmatrix}\sigma_X^2&\rho_{XY}\sigma_X\sigma_Y\\ \rho_{XY}\sigma_X\sigma_Y&\sigma_X^2\end{bmatrix}\right)$

The first term in the parentheses is the vector of means and the second term (the square matrix in the brackets) is the covariance matrix of X and Y. It is not necessary to understand the notation. The main point is that X and Y are both normal, they have a linear relationship, and the conditional variance of Y at any value of X is the same.

The conditional standard deviation of Y at any particular value of X is: $\sigma_{Y|X=x}=\sigma_Y\sqrt{1-\rho_{xy}^2}$

This is the standard deviation of the conditional normal distribution. In the language of regression, it is the standard error of the estimate (σe). It is the standard deviation of the residuals (errors). Residuals are simply the amount by which your guesses differ from the actual values. $e = y - E(Y|X=x)=y-\hat{Y}$

So, $\sigma_{Y|X=x}=\sigma_e$

So, putting all this together, we can answer our question:

How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?

The expected value of Y (Ŷ) is: $E(Y|X=120)=15\rho_{XY}\dfrac{120-100}{15}+100$

Suppose that the correlation is 0.5. Therefore, $E(Y|X=120)=15*0.5\dfrac{120-100}{15}+100=110$

This means that among all the people with a vocabulary score of 120, the average is 110 on reading comprehension. Now, how far off from that is 90? $e= y - \hat{Y}=90-110=-20$

What is the standard error of the estimate? $\sigma_{Y|X=x}=\sigma_e=\sigma_Y\sqrt{1-\rho_{xy}^2}$ $\sigma_{Y|X=x}=\sigma_e=15\sqrt{1-0.5^2}\approx12.99$

Dividing the residual by the standard error of the estimate (the standard deviation of the conditional normal distribution) gives us a z-score. It represents how far from expectations this individual is in standard deviation units. $z=\dfrac{e}{\sigma_e} \approx\dfrac{-20}{12.99}\approx -1.54$

Using the standard normal cumulative distribution function (Φ) gives us the proportion of people scoring 90 or less on reading comprehension (given a vocabulary score of 120). $\Phi(z)\approx\Phi(-1.54)\approx 0.06$

In Microsoft Excel, the standard normal cumulative distribution function is NORMSDIST. Thus, entering this into any cell will give the answer:

=NORMSDIST(-1.54)

Multiple Regression

What proportion of people score 90 or less on reading comprehension if their vocabulary is 120 but their working memory capacity is 80?

Let’s call vocabulary X1 and working memory capacity X2. Let’s suppose they correlated at 0.3. The correlation matrix among the predictors (RX): $\mathbf{R_X}=\begin{bmatrix}1&\rho_{12}\\ \rho_{12}&1\end{bmatrix}=\begin{bmatrix}1&0.3\\ 0.3&1\end{bmatrix}$

The validity coefficients are the correlations of Y with both predictors (RXY): $\mathbf{R}_{XY}=\begin{bmatrix}\rho_{Y1}\\ \rho_{Y2}\end{bmatrix}=\begin{bmatrix}0.5\\ 0.4\end{bmatrix}$

The standardized regression coefficients (β) are: $\pmb{\mathbf{\beta}}=\mathbf{R_{X}}^{-1}\mathbf{R}_{XY}\approx\begin{bmatrix}0.418\\ 0.275\end{bmatrix}$

Unstandardized coefficients can be obtained by multiplying the standardized coefficients by the standard deviation of Y (σY) and dividing by the standard deviation of the predictors (σX): $\mathbf{b}=\sigma_Y\pmb{\mathbf{\beta}}/\pmb{\mathbf{\sigma}}_X$

However, in this case all the variables have the same metric and thus the unstandardized and standardized coefficients are the same.

The vector of predictor means (μX) is used to calculate the intercept (b0): $b_0=\mu_Y-\mathbf{b}' \pmb{\mathbf{\mu}}_X$ $b_0\approx 100-\begin{bmatrix}0.418\\ 0.275\end{bmatrix}^{'} \begin{bmatrix}100\\ 100\end{bmatrix}\approx 30.769$

The predicted score when vocabulary is 120 and working memory capacity is 80 is: $\hat{Y}=b_0 + b_1 X_1 + b_2 X_2$ $\hat{Y}\approx 30.769+0.418*120+0.275*80\approx 102.9$

The error in this case is 90-102.9=-12.9:

The multiple R2 is calculated with the standardized regression coefficients and the validity coefficients. $R^2 = \pmb{\mathbf{\beta}}'\pmb{\mathbf{R}}_{XY}\approx\begin{bmatrix}0.418\\ 0.275\end{bmatrix}^{'} \begin{bmatrix}0.5\\ 0.4\end{bmatrix}\approx0.319$

The standard error of the estimate is thus: $\sigma_e=\sigma_Y\sqrt{1-R^2}\approx 15\sqrt{1-0.319^2}\approx 12.38$

The proportion of people with vocabulary = 120 and working memory capacity = 80 who score 90 or less is: $\Phi\left(\dfrac{e}{\sigma_e}\right)\approx\Phi\left(\dfrac{-12.9}{12.38}\right)\approx 0.15$

Standard

A Gentle, Non-Technical Introduction to Factor Analysis

When measuring characteristics of physical objects, there may be some disagreement about the best methods to use but there is little disagreement about which dimensions are being measured. We know that we are measuring length when we use a ruler and we know that we are measuring temperature when we use a thermometer. It is true that heating some materials makes them expand but we are virtually never confused about whether heat and length represent distinct dimensions that are independent of each other. That is, they are independent of each other in the sense that things can be cold and long, cold and short, hot and long, or hot and short.

Unfortunately, we are not nearly as clear about what we are measuring when we attempt to measure psychological dimensions such as personality traits, motivations, beliefs, attitudes, and cognitive abilities. Psychologists often disagree not only about what to name these dimensions but also about how many dimensions there are to measure. For example, you might think that there exists a personality trait called niceness. Another person might disagree with you, arguing that niceness is a vague term that lumps together 2 related but distinguishable traits called friendliness and kindness. Another person could claim that kindness is too general and that we must separate kindness with friends from kindness with strangers.

As you might imagine, these kinds of arguments can quickly lead to hypothesizing the existence of as many different traits as our imaginations can generate. The result would be a hopeless confusion among psychological researchers because they would have no way to agree on what to measure so that they can build upon one another’s findings. Fortunately, there are ways to put some limits on the number of psychological dimensions and come to some degree of consensus about what should be measured. One of the most commonly used of such methods is called factor analysis.

Although the mathematics of factor analysis is complicated, the logic behind it is not difficult to understand. The assumption behind factor analysis is that things that co-occur tend to have a common cause (but not always). For example, fevers, sore throats, stuffy noses, coughs, and sneezes tend to occur at roughly the same time in the same person. Often, they are caused by the same thing, namely, the virus that causes the common cold. Note that although the virus is one thing, its manifestations are quite diverse. In psychological assessment research, we measure a diverse set of abilities, behaviors and symptoms and attempt to deduce which underlying dimensions cause or account for the variations in behavior and symptoms we observe in large groups of people. We measure the relations between various behaviors, symptoms, and test scores with correlation coefficients and use factor analysis to discover patterns of correlation coefficients that suggest the existence of underlying psychological dimensions.

All else being equal, a simple theory is better than a complicated theory. Therefore, factor analysis helps us discover the smallest number of psychological dimensions (i.e., factors) that can account for the correlation patterns in the various behaviors, symptoms, and test scores we observe. For example, imagine that we create 4 different tests that would measure people’s knowledge of vocabulary, grammar, arithmetic, and geometry. If the correlations between all of these tests were 0 (i.e., high scorers on one test are no more likely to score high on the other tests than low scorers), then the factor analysis would suggest to us that we have measured 4 distinct abilities. In the picture below, the correlations between all the tests are displayed in the table. Below that, the theoretical model that would be implied is that there are 4 abilities (shown as ellipses) that influence performance on 4 tests (shown as rectangles). The numbers beside the arrows imply that the abilities and the tests have high but imperfect correlations of 0.9. Of course, you probably recognize that it is very unlikely that the correlations between these tests would be 0. Therefore, imagine that the correlation between the vocabulary and grammar tests is quite high (i.e., high scorers on vocabulary are likely to also score high on grammar and low scorers on vocabulary are likely to score low on grammar). The correlation between arithmetic and geometry is high also. Furthermore, the correlations between the language tests and the mathematics tests is 0. Factor analysis would suggest that we have measured not 4 distinct abilities but rather 2 abilities. Researchers interpreting the results of the factor analysis would have to use their best judgment to decide what to call these 2 abilities. In this case, it would seem reasonable to call them language ability and mathematical ability. These 2 abilities (shown below as ellipses) influence performance on 4 tests (shown as rectangles). Now imagine that the correlations between all 4 tests is equally high. That is, for example, vocabulary is just as strongly correlated with geometry as it is with grammar. In this case, factor analysis would suggest that the simplest explanation for this pattern of correlations is that there is just 1 factor that causes all of these tests to be equally correlated. We might call this factor general academic ability. In reality, if you were to actually measure these 4 abilities, the results would not be so clear. It is likely that all of the correlations would be positive and substantially above 0. It is also likely that the language subtests would correlate more strongly with each other than with the mathematical subtests. In such a case, factor analysis would suggest that language and mathematical abilities are distinct but not entirely independent from each other. That is, language abilities and mathematics abilities are substantially correlated with each other, suggesting that a general academic (or intellectual) ability influences performance in all academic areas. In this model, abilities are arranged in hierarchies with general abilities influencing narrow abilities. Exploratory Factor Analysis

Factor analysis can help researchers decide how best to summarize large amounts of information about people using just a few scores. For example, when we ask parents to complete questionnaires about behavior problems their children might have, the questionnaires can have hundreds of items. It would take too long and would be too confusing to review every item. Factor analysis can simplify the information while minimizing the loss of detail. Here is an example of a short questionnaire that factor analysis can be used to summarize.

On a scale of 1 to 5, compared to other children his or her age, my child:

1. gets in fights frequently at school
3. is very impulsive
4. has stomachaches frequently
5. is anxious about many things
6. appears sad much of the time

If we give this questionnaire to a large, representative sample of parents, we can calculate the correlations between the items:

1 2 3 4 5 6
1. gets in fights frequently at school
2. is defiant to adults .81
3. is very impulsive .79 .75
4. has stomachaches frequently .42 .38 .36
5. is anxious about many things .39 .34 .34 .77
6. appears sad much of the time .37 .34 .32 .77 .74

Using this set of correlation coefficients, factor analysis suggests that there are 2 factors being measured by this behavior rating scale. The logic of factor analysis suggests that the reason items 1-3 have high correlations with each other is that each of them has a high correlation with the first factor. Similarly, items 4-6 have high correlations with each other because they have high correlations with the second factor. The correlations that the items have with the hypothesized factors are called factor loadings. The factor loadings can be seen in the chart below:

 Factor 1 Factor 2 1. gets in fights frequently at school .91 .03 2. is defiant to adults .88 -.01 3. is very impulsive .86 -.01 4. has stomachaches frequently .02 .89 5. is anxious about many things .01 .86 6. appears sad much of the time -.02 .87

Factor analysis tells us which items “load” on which factors but it cannot interpret the meaning of the factors. Usually researchers look at all of the items that load on a factor and use their intuition or knowledge of theory to identify what the items have in common. In this case, Factor 1 could receive any number of names such as Conduct Problems, Acting Out, or Externalizing Behaviors. Likewise, Factor 2 could be called Mood Problems, Negative Affectivity, or Internalizing Behaviors. Thus, the problems on this behavior rating scale can be summarized fairly efficiently with just 2 scores. In this example, a reduction of 6 scores to 2 scores may not seem terribly useful. In actual behavior rating scales, factor analysis can reduce the overwhelming complexity of hundreds of different behavior problems to a more manageable number of scores that help professionals more easily conceptualize individual cases.

It should be noted that factor analysis also calculates the correlation among factors. If a large number of factors are identified and there are substantial correlations (i.e., significantly larger than 0) among factors, this new correlation matrix can be factor analyzed also to obtain second-order factors. These factors, in turn, can be analyzed to obtain third-order factors. Theoretically, it is possible to have even higher order factors but most researchers rarely find it necessary to go beyond third-order factors. The g-factor from intelligence test data is an example of a third-order factor that emerges because all tests of cognitive abilities are positively correlated. In our example above, the 2 factors have a correlation of .46, suggesting that children who have externalizing problems are also at risk of having internalizing problems. It is therefore reasonable to calculate a second-order factor score that measures the overall level of behavior problems.

This example illustrates the most commonly used type of factor analysis: exploratory factor analysis. Exploratory factor analysis is helpful when we wish to summarize data efficiently, we are not sure how many factors are present in our data, or we are not sure which items load on which factors.

Confirmatory Factor Analysis

Confirmatory factor analysis is a method that researchers can use to test highly specific hypotheses. For example, a researcher might want to know if the 2 different types of items on the WISC-IV Digit Span subtest measures the same ability or 2 different abilities. On the Digits Forward type of item, the child must repeat a string of digits in the same order in which they were heard. On the Digits Backward type of item, the child must repeat the string of digits in reverse order. Some researchers believe that repeating numbers verbatim measures auditory short-term memory storage capacity and that repeating numbers in reverse order measures executive control, the ability to allocate attentional resources efficiently to solve multi-step problems. Typically, clinicians add the raw scores of both types of items to produce a single score. If the 2 item types measure different abilities, adding the raw scores together is like adding apples and orangutans. If, however, they measure the same ability, adding the scores together is valid and will produce a more reliable score than using separate scores.

To test this hypothesis, we can use confirmatory factor analysis to see if the 2 item types measure different abilities. We would need to identify or invent several tests that are likely to measure the 2 separate abilities that we believe are measured by the 2 types of Digit Span items. Usually, using 3 tests per factor is sufficient.

Next, we specify the hypotheses, or models, we wish to test:

1. All of the tests measure the same ability. A graphical representation of a hypothesis in confirmatory factor analysis is called a path diagram. Tests are drawn with rectangles and hypothetical factors are drawn with ovals. The correlations between tests and factors are drawn with arrows. The path diagram for this hypothesis would look like this: 1. Both Digits Forward and Digits Backward measure short-term memory storage capacity and are distinct from executive control. The path diagram would look like this (the curved arrow allows for the possibility that the 2 factors might be correlated): 1. Digits Forward and Digits Backward measure different abilities. The path diagram would look like this: Confirmatory factor analysis produces a number of statistics, called fit statistics that tell us which of the models or hypotheses we tested are most in agreement with the data. Studying the results, we can select the best model or perhaps generate a new model if none of them provide a good “fit” with the data. With structural equation modeling, a procedure that is very similar to confirmatory factor analysis, we can test extremely complex hypotheses about the structure of psychological variables.

This post is a revised version of a tutorial I originally prepared for Cohen & Swerdlik’s Psychological Testing and Assessment: An Introduction To Tests and Measurement

Standard

Can’t Decide Which IQ Is Best? Make a Composite Score.

A man with a watch knows what time it is. A man with two watches is never sure.

-Segal’s Law

Suppose you have been asked to settle a matter with important implications for an evaluee. A young girl was diagnosed with mental retardation [now called intellectual disability] three years ago. Along with low adaptive functioning, her Full Scale IQ was a 68, two points under the traditional line used to diagnose [intellectual disability]. Upon re-evaluation two months ago, her IQ, derived from a different test, was now 78. Worried that their daughter would no longer qualify for services, the family paid out of pocket to have their daughter evaluated by another psychologist and the IQ came out as 66. Because of your reputation for being fair-minded and knowledgeable, you have been asked to decide which, if any, is the real IQ. Of course, there is no such thing as a “real IQ” but you understand what the referral question is.

You give a different battery of tests and the girl scores a 76. Now what should be done? It would be tempting to assume that, “Other psychologists are sloppy, whereas my results are free of error.” However, you are fair minded. You know that all scores have measurement error and you plot the scores and their 95% confidence intervals as seen in Figure 1.

Figure 1

Recent IQ Scores and their 95% Confidence Intervals from the Same Individual It is clear that Test C’s confidence interval does not overlap with those of Tests B and D. Is this kind of variability in scores unusual? There are two tests that indicate an IQ in the high 60’s and two tests that indicate an IQ in the high 70’s. Which pair of tests is correct? Should the poor girl be subjected to yet another test that might act as a tie breaker?

Perhaps the fairest solution is to treat each IQ test as subtests of a much larger “Mega-IQ Test.” That is, perhaps the best that can be done is to combine the four IQ scores into a single score and then construct a confidence interval around it.

Where should the confidence interval be centered? Intuitively, it might seem reasonable to simply average all four IQ results and say that the IQ is 72. However, this is not quite right. Averaging scores gives a rough approximation of a composite score but it is less accurate for low and high scorers than it is for scorers near the mean. An individual’s composite score is further away from the population mean than the average of the individual’s subtest scores. About 3.1% of people score a 72 or lower on a single IQ test (assuming perfect normality). However, if we were to imagine a population of people who took all four IQ tests in question, only 1.9% of them would have an average score of 72 or lower. That is, it is more unusual to have a mean IQ of 72 than it is to score a 72 IQ on any particular IQ test. It is unusual to score 72 on one IQ test but it is even more unusual to score that low on more than one test on average. Another way to think about this issue is to recognize that the mean score cannot be interpreted as an IQ score because it has a smaller standard deviation than IQ scores have. To make it comparable to IQ, it needs to be rescaled so that it has a “standard” standard deviation of 15.

Here is a good method for computing a composite score and its accompanying 95% confidence interval. It is not nearly as complicated as it might seem at first glance. This method assumes that you know the reliability coefficients of all the scores and you know all the correlations between the scores. All scores must be index scores (μ = 100, σ = 15). If they are not, they can be converted using this formula: $\text{Index Score} = 15(\dfrac{X-\mu}{\sigma})+100$

Computing a Composite Score

Step 1: Add up all of the scores.

In this case, $68 + 78 + 66 + 76 = 288$

Step 2: Subtract the number of tests times 100.

In this case there are 4 tests. Thus, $288-4 * 100 = 288-400 = -112$

Step 3: Divide by the square root of the sum of all the elements in the correlation matrix.

In this case, suppose that the four tests are correlated as such:

 Test A Test B Test C Test D Test A 1 0.80 0.75 0.85 Test B 0.80 1 0.70 0.71 Test C 0.75 0.70 1 0.78 Test D 0.85 0.71 0.78 1

The sum of all 16 elements, including the ones in the diagonal is 13.18. The square root of 13.18 is about 3.63. Thus, $-112 / 3.63 = -30.85$

Step 4: Complete the computation of the composite score by adding 100.

In this case, $-30.82 + 100 = 69.18$

Given the four IQ scores available, assuming that there is no reason to favor one above the others, the best estimate is that her IQ is 69. Most of the time, there is no need for further calculation. However, we might like to know how precise this estimate is by constructing a 95% confidence interval around this score.

Confidence Intervals of Composite Scores

Calculating a 95% confidence interval is more complicated than the calculations above but not overly so.

Step 1: Calculate the composite reliability.

Step 1a: Subtract the number of tests from the sum of the correlation matrix.

In this case, there are 4 tests. Therefore, $13.18-4 = 9.18$

Step 1b: Add in all the test reliability coefficients.

In this case, suppose that the four reliability coefficients are 0.97, 0.96, 0.98, and 0.97. Therefore, $9.18 + 0.97 + 0.96 + 0.98 + 0.97 = 13.06$

Step 1c: Divide by the original sum of the correlation matrix.

In this case, $13.06 / 13.18 \approx 0.9909$

Therefore, in this case, the reliability coefficient of the composite score is higher than that of any single IQ score. This makes sense, given that we have four scores, we should know what her IQ is with greater precision than we would if we only had one score.

Step 2: Calculate the standard error of the estimate by subtracting the reliability coefficient squared from the reliability coefficient and taking the square root. Then, multiply by the standard deviation, 15.

In this case, $15\sqrt{0.9909-0.9909^2}\approx 1.4247$

Step 3: Calculate the 95% margin of error by multiplying the standard error of the estimate by 1.96.

In this case, $1.96 * 1.44247 \approx 2.79$

The value 1.96 is the approximate z-score associated with the 95% confidence interval. If you want the z-score associated with a different margin of error, then use the following Excel formula. Shown here is the calculation of the z-score for a 99% confidence interval: $=\mathrm{NORMSINV}(1-(1-0.99)/2)$

Step 4: Calculate the estimated true score by subtracting 100 from the composite score, multiplying the reliability coefficient, and adding 100. That is, $\text{Estimated True Score} =\text{Reliability Coefficient} * (\text{Composite} - 100) + 100$

In this case, $0.9909*(69.18-100)+100=69.46$

Step 5: Calculate the upper and lower bounds of the 95% confidence interval by starting with the estimated true score and then adding and subtracting the margin of error.

In this case, $69.46 \pm 2.79 = 66.67 \text{ to } 72.25$

This means that we are 95% sure that her IQ is between about 67 and 72. Assuming that other criteria for mental retardation [intellectual disability] are met, this is in the range to qualify for services in most states. It should be noted that this procedure can be used for any kind of composite score, not just for IQ tests.

 This degree of profile variability is not at all unusual. In fact, it is quite typical. A statistic called the Mahalanobis Distance (Crawford & Allen, 1994) can be used to estimate how typical an individual profile of scores is compared to a particular population of score profiles. Using the given correlation matrix and assuming multivariate normality, this profile is at the 86th percentile in terms of profile unusualness…and almost of all of the reason that it is unusual is that its overall elevation is unusually low (Mean = 72). If we consider only those profiles that have an average score of 72, this profile’s unusualness is at the 54th percentile (Schneider, in preparation). That is, the amount of variability in this profile is typical compared to other profiles with an average score of 72.

This post is an excerpt from:

Schneider, W. J. (2013). Principles of assessment of aptitude and achievement. In D. Saklofske, C. Reynolds, & V. Schwean (Eds.), Oxford handbook of psychological assessment of children and adolescents (pp. 286–330). New York: Oxford.

Figure 1 was updated from the original to show more accurately the precision that IQ scores have.

Standard

Predicted Achievement Using Simple Linear Regression

There are two ways to make an estimate of a person’s abilities. A point estimate (a single number) is precise but usually wrong, whereas an interval estimate (a range of numbers) is usually right but can be so wide that it is nearly useless. Confidence intervals combine both types of estimates in order to balance the weaknesses of one type of estimate with the strengths of the other. If I say that Suzie’s expected reading comprehension is 85 ± 11, the 85 is the point estimate (also known as the expected score or the predicted score or just Ŷ). The ± 11 is called the margin of error. If the confidence level is left unspecified, by convention we mean the 95% margin of error. If I add 11 and subtract 11 to get a range from 74 to 96, I have the respective lower and upper bounds of the 95% confidence interval.

Calculating the Predicted Achievement Score

I will assume that both the IQ and achievement scores are index scores (μ = 100, σ = 15) to make things simple. The predicted achievement score is a point estimate. It represents the best guess we can make in the absence of other information. The equation below is called a regression equation. $\hat{Y}=\sigma_Y r_{XY} \frac{X-\mu_X}{\sigma_X}+\mu_Y$

If X is IQ, Y is Achievement, and both scores are index scores (μ = 100, σ = 15), the regression equation simplifies to:

Predicted achievement = (Correlation between IQ and Achievement) (IQ – 100) + 100

Calculating the Confidence Interval for the Predicted Achievement Score

Whenever you make a prediction using regression, your estimate is not exactly right very often. It is expected to differ from the actual achievement score by a certain amount (on average). This amount is called the standard error of the estimate. It is the standard deviation of all the prediction errors. Thus, it is the standard to which all the errors in your estimates are compared. When both scores are index scores, the formula is $\text{Standard error of the estimate}=\sqrt{1-r^2_{XY}}$

To calculate the margin of error, multiply the standard error of the estimate by the z-score that corresponds to the degree of confidence desired. In Microsoft Excel the formula for the z-score corresponding to the 95% confidence interval is

=NORMSINV(1-(1-0.95)/2)

≈1.96

For the 95% confidence interval, multiply the standard error of the estimate by 1.96. The 95% confidence interval’s formula is

95% Confidence Interval = Predicted achievement ± 1.96 * Standard error of the estimate

This interval estimates the achievement score for 95% of people with the same IQ as the child. About 2.5% will score lower than this estimate and 2.5% will score higher.

You can use Excel to estimate how unusual it is for an observed achievement score to differ from a predicted achievement score in a particular direction by using this formula,

=NORMSDIST(-1*ABS(Observed-Predicted)/(Standard error of the estimate))

If a child’s observed achievement score is unusually low, it does not automatically mean that the child has a learning disorder. Many other things need to be checked before that diagnosis can be considered valid. However, it does mean that an explanation for the unusually low achievement score should be sought.

This post is an excerpt from:

Schneider, W. J. (2013). Principles of assessment of aptitude and achievement. In D. Saklofske, C. Reynolds, & V. Schwean (Eds.), Oxford handbook of psychological assessment of children and adolescents (pp. 286–330). New York: Oxford

Standard

Video Tutorial: Misunderstanding Regression to the Mean

One of the most widely misunderstood statistical concepts is regression to the mean. In this video tutorial, I address common false beliefs about regression to the mean and answer the following questions:

1. What is regression to the mean?
2. Do variables become less variable each time they are measured?
3. Does regression to the mean happen all the time or just in certain situations?
4. Does repeated testing cause people to come closer and closer to the mean?
5. How is regression to the mean relevant in death penalty cases?

Standard

Keeping beautiful R graphs beautiful in PowerPoint

Over the past few years I have fallen in love with graphs made with R. They can be as beautiful as your imagination can make them. When converted to .pdf, thanks to vector graphics, they stay crisp at any size.

Unfortunately, it is not always easy to preserve this beauty when I want to publish these graphs or show them in professional presentations. Yes, all the glories of R graphics are preserved perfectly using knitr + Latex + Beamer + Slidify. However, I have many presentations that I have painstakingly prepared in PowerPoint. If I want to add just one new image to such a presentation, it would be very difficult to remake the whole presentation in a different format. Furthermore, almost no Psychology journals accept Latex documents.

Because I was unaware of a better solution, I have made graphics in R, converted them to .pdf, used Adobe Reader’s snapshot tool to copy the images (zoomed to 200%), and pasted the images as bitmaps in PowerPoint or Word. The images look okay but they are not truly scalable. It has been disappointing to see them lose image quality.

MS Office documents have a scalable vector format for images called enhanced metafile (.emf). R has the ability to save graphics as enhanced metafiles but the results are ugly (no anti-aliasing).

The best (free) solution I have found so far is to save graphics made in R as scalable vector graphics (.svg), open them in Inkscape (a free alternative to Adobe Illustrator), saving them as enhanced metafiles (.emf), and importing them into PowerPoint or Word. The results are (nearly) perfect! The only flaws I have identified so far is that semi-transparent colors become opaque and some lines with flat endings become lines with rounded endings. In most cases, I am very satisfied with the results.

Also, enhanced metafiles usually have extremely large file sizes when made in PowerPoint. However, when made in R the .emf file sizes have been very reasonable.

Standard

Estimating Latent Scores in Individuals

How to estimate latent scores in individuals when there is a known structural model:

I wrote a commentary in a special issue of the Journal Psychoeducational Assessment. My article proposes a new way to interpret cognitive profiles. The basic idea is to use the best available latent variable model of the tests and then estimate an individual’s latent scores (with confidence intervals around those estimates). I have made two spreadsheets available, one for the WISC-IV and one for the WAIS-IV.

Five-Factor Model of the WISC-IV

Four-Factor Model of the WAIS-IV

I decided not to provide a spreadsheet for the five-factor model of the WAIS-IV because Gf and g were so highly correlated in that model that it would be nearly impossible to distinguish between Gf and g in individuals. You can think of Gf and g as nearly synonymous (at the latent level).

Schneider, W. J. (2013). What if we took our models seriously? Estimating latent scores in individuals. Journal of Psychoeducational Assessment, 31, 186–201.

Standard

Psychometrics from the Ground Up 10: Covariance

In this tutorial I discuss and present visual representations of covariance. Although covariance is not directly informative, it is a fundamental ingredient in almost all of the most frequently used statistical procedures.

Standard

Psychometrics from the Ground Up 9: Standard Scores and Why We Need Them

In this video tutorial, I explain why we have standard scores, why there are so many different kinds of standard scores, and how to convert between any two types of standard scores.

Standard