Cognitive Assessment, Psychometrics, R, Statistics, Tutorial

Using the multivariate truncated normal distribution

In a previous post, I imagined that there was a gifted education program that had a strictly enforced selection procedure: everyone with an IQ of 130 or higher is admitted. With the (univariate) truncated normal distribution, we were able to calculate the mean of the selected group (mean IQ = 135.6).

Truncated Normal

Multivariate Truncated Normal Distributions

Reading comprehension has a strong relationship with IQ (\rho\approx 0.70). What is the average reading comprehension score among students in the gifted education program? If we can assume that reading comprehension is normally distributed (\mu=100, \sigma=15) and the relationship between IQ and reading comprehension is linear (\rho=0.70), then we can answer this question using the multivariate truncated normal distribution. Portions of the multivariate normal distribution have been truncated (sliced off). In this case, the blue portion of the bivariate normal distribution of IQ and reading comprehension has been sliced off. The portion remaining (in red), is the distribution we are interested in. Here it is in 3D:

Bivariate normal distribution truncated at IQ = 130

Bivariate normal distribution truncated at IQ = 130

Here is the same distribution with simulated data points in 2D:

Truncated bivariate normal distribution with simulated data

Expected values of IQ and reading comprehension when IQ ≥ 130

Expected Values

In the picture above, the expected value (i.e., mean) for the IQ of the students in the gifted education program is 135.6. In the last post, I showed how to calculate this value.

The expected value (i.e., mean) for the reading comprehension score is 124.9. How is this calculated? The general method is fairly complicated and requires specialized software such as the R package tmvtnorm. However in the bivariate case with a single truncation, we can simply calculate the predicted reading comprehension score when IQ is 135.6:

\dfrac{\hat{Y}-\mu_Y}{\sigma_Y}=\rho_{XY}\dfrac{X-\mu_X}{\sigma_X}

\dfrac{\hat{Y}-100}{15}=0.7\dfrac{135.6-100}{15}

\hat{Y}=124.9

In R, the same answer is obtained via the tmvtnorm package:

library(tmvtnorm)
#Variable names
vNames<-c("IQ","Reading Comprehension")

#Vector of Means
mu<-c(100,100)
names(mu)<-vNames;mu

#Vector of Standard deviations
sigma<-c(15,15)
names(sigma)<-vNames;sigma

#Correlation between IQ and Reading Comprehension
rho<-0.7

#Correlation matrix
R<-matrix(c(1,rho,rho,1),ncol=2)
rownames(R)<-colnames(R)<-vNames;R

#Covariance matrix
C<-diag(sigma)%*%R%*%diag(sigma)
rownames(C)<-colnames(C)<-vNames;C

#Vector of lower bounds (-Inf means negative infinity)
a<-c(130,-Inf)

#Vector of upper bounds (Inf means positive infinity)
b<-c(Inf,Inf)

#Means and covariance matrix of the truncated distribution
m<-mtmvnorm(mean=mu,sigma=C,lower=a,upper=b)
rownames(m$tvar)<-colnames(m$tvar)<-vNames;m

#Means of the truncated distribution
tmu<-m$tmean;tmu

#Standard deviations of the truncated distribution
tsigma<-sqrt(diag(m$tvar));tsigma

#Correlation matrix of the truncated distribution
tR<-cov2cor(m$tvar);tR

In running the code above, we learn that the standard deviation of reading comprehension has shrunk from 15 in the general population to 11.28 in the truncated population. In addition, the correlation between IQ and reading comprehension has shrunk from 0.70 in the general population to 0.31 in the truncated population.

Marginal cumulative distributions

Among the students in the gifted education program, what proportion have reading comprehension scores of 100 or less? The question can be answered with the marginal cumulative distribution function. That is, what proportion of the red truncated region is less than 100 in reading comprehension? Assuming that the code in the previous section has been run already, this code will yield the answer of about 1.3%:

#Proportion of students in the gifted program with reading comprehension of 100 or less
ptmvnorm(lowerx=c(-Inf,-Inf),upperx=c(Inf,100),mean=mu,sigma=C,lower=a,upper=b)

The mean, sigma, lower, and upper parameters define the truncated normal distribution. The lowerx and the upperx parameters define the lower and upper bounds of the subregion in question. In this case, there are no restrictions except an upper limit of 100 on the second axis (the Y-axis).

If we plot the cumulative distribution of reading comprehension scores in the gifted population, it is close to (but not the same as) that of the conditional distribution of reading comprehension at IQ = 135.6.

Marginal Truncated Bivarate Normal

Marginal cumulative distribution function of the truncated bivariate normal distribution

What proportion does the truncated distribution occupy in the untruncated distribution?

Imagine that in order to qualify for services for intellectual disability, a person must score 70 or below on an IQ test. Every three years, the person must undergo a re-evaluation. Suppose that the correlation between the original test and the re-evaluation test is \rho=0.90. If the entire population were given both tests, what proportion would score 70 or lower on both tests? What proportion would score below 70 on the first test but not on the second test? Such questions can be answered with the pmvnorm function from the mvtnorm package (which is a prerequiste of the tmvtnorm package and this thus already loaded if you ran the previous code blocks).

library(mvtnorm)
#Means
IQmu<-c(100,100)

#Standard deviations
IQsigma<-c(15,15)

#Correlation
IQrho<-0.9

#Correlation matrix
IQcor<-matrix(c(1,IQrho,IQrho,1),ncol=2)

#Covariance matrix
IQcov<-diag(IQsigma)%*%IQcor%*%diag(IQsigma)

#Proportion of the general population scoring 70 or less on both tests
pmvnorm(lower=c(-Inf,-Inf),upper=c(70,70),mean=IQmu,sigma=IQcov)

#Proportion of the general population scoring 70 or less on the first test but not on the second test
pmvnorm(lower=c(-Inf,70),upper=c(70,Inf),mean=IQmu,sigma=IQcov)

What are the means of these truncated distributions?

#Mean scores among people scoring 70 or less on both tests
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,-Inf),upper=c(70,70))

#Mean scores among people scoring 70 or less on the first test but not on the second test
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,70),upper=c(70,Inf))

Combining this information in a plot:

TwoLowerBoundsTruncatedThus, we can see that the multivariate truncated normal distribution can be used to answer a wide variety of questions. With a little creativity, we can apply it to many more kinds of questions.

Standard
My Software & Spreadsheets, Psychometrics, Statistics

Using the truncated normal distribution

The term truncated normal distribution may sound highly technical but it is actually fairly simple and has many practical applications. If the math below is daunting, be assured that it is not necessary to understand the notation and the technical details. I have created a user-friendly spreadsheet that performs all the calculations automatically.

The mean of a truncated normal distribution

Imagine that your school district has a gifted education program. All students in the program have an IQ of 130 or higher. What is the average IQ of this group? Assume that in your school district, IQ is normally distributed with a mean of 100 and a standard deviation of 15.

Truncated Normal

Questions like this one can be answered by calculating the mean of the truncated normal distribution. The truncated normal distribution is a normal distribution in which one or both ends have been sliced off (i.e., truncated). In this case, everything below 130 has been sliced off (and there is no upper bound).

Four parameters determine the properties of the truncated normal distribution:

μ = mean of the normal distribution (before truncation)
σ = standard deviation of the normal distribution (before truncation)
a = the lower bound of the distribution (can be as low as −∞)
b = the upper bound of the distribution (can be as high as +∞)

The formula for the mean of a truncated distribution is a bit of a mess but can be simplified by finding the z-scores associated with the lower and upper bounds of the distribution:

z_a=\dfrac{a-\mu}{\sigma}

z_b=\dfrac{b-\mu}{\sigma}

The expected value of the truncated distribution (i.e., the mean):
E(X)=\mu+\sigma\dfrac{\phi(z_a)-\phi(z_b)}{\Phi(z_b)-\Phi(z_a)}

Where \phi is the probability density function of the standard normal distribution (NORMDIST(z,0,1,FALSE) in Excel, dnorm(z) in R) and \Phi is the cumulative distribution function of the standard normal distribution (NORMSDIST(z) in Excel, pnorm(z) in R).

This spreadsheet calculates the mean (and standard deviation) of a truncated distribution. See the part below the plot that says “Truncated Normal Distribution.”

In R you could make a function to calculate the mean of a truncated distribution like so:

MeanNormalTruncated<-function(mu=0,sigma=1,a=-Inf,b=Inf){
  mu+sigma*(dnorm((a-mu)/sigma)-dnorm((b-mu)/sigma))/(pnorm((b-mu)/sigma)-pnorm((a-mu)/sigma))
}

#Example: Find the mean of a truncated normal distribution with a mu = 100, sigma = 15, and lower bound = 130
MeanNormalTruncated(mu=100,sigma=15,a=130)

The cumulative distribution function of the truncated normal distribution

Suppose that we wish to know the proportion of students in the same gifted education program who score 140 or more. The cumulative truncated normal distribution function tells us the proportion of the distribution that is less than a particular value.

cdf=\dfrac{\Phi(z_x)-\Phi(z_a)}{\Phi(z_b)-\Phi(z_a)}

Where z_x = \dfrac{X-\mu}{\sigma}

In the previously mentioned spreadsheet, the cumulative distribution function is the proportion of the shaded region that is less than the value you specify.

You can create your own cumulative distribution function for the truncated normal distribution in R like so:

cdfNormalTruncated<-function(x=0,mu=0,sigma=1,a=-Inf,b=Inf){
  (pnorm((x-mu)/sigma)-pnorm((a-mu)/sigma))/(pnorm((b-mu)/sigma)-pnorm((a-mu)/sigma))
}
#Example: Find the proportion of the distribution less than 140
cdfNormalTruncated(x=140,mu=100,sigma=15,a=130)

In this case, the cumulative distribution function returns approximately 0.8316. Subtracting from 1, gives the proportion of scores 140 and higher: 0.1684. This means that about 17% of students in the gifted program can be expected to have IQ scores of 140 or more.1

The truncated normal distribution in R

A fuller range of functions related to the truncated normal distribution can be found in the truncnorm package in R, including the expected value (mean), variance, pdf, cdf, quantile, and random number generation functions.

1 In the interest of precision, I need to say that because IQ scores are rounded to the nearest integer, a slight adjustment needs to be made. The true lower bound of the truncated distribution is not 130 but 129.5. Furthermore, we want the proportion of scores 139.5 and higher, not 140 and higher. This means that the expected proportion of students with IQ scores of “140” and higher in the gifted program is about 0.1718 instead of 0.1684. Of course, there is little difference between these estimates and such precision is not usually needed for “back-of-the-envelope” estimates such as this one.
Standard