# Using the multivariate truncated normal distribution

In a previous post, I imagined that there was a gifted education program that had a strictly enforced selection procedure: everyone with an IQ of 130 or higher is admitted. With the (univariate) truncated normal distribution, we were able to calculate the mean of the selected group (mean IQ = 135.6).

# Multivariate Truncated Normal Distributions

Reading comprehension has a strong relationship with IQ $(\rho\approx 0.70)$. What is the average reading comprehension score among students in the gifted education program? If we can assume that reading comprehension is normally distributed $(\mu=100, \sigma=15)$ and the relationship between IQ and reading comprehension is linear $(\rho=0.70)$, then we can answer this question using the multivariate truncated normal distribution. Portions of the multivariate normal distribution have been truncated (sliced off). In this case, the blue portion of the bivariate normal distribution of IQ and reading comprehension has been sliced off. The portion remaining (in red), is the distribution we are interested in. Here it is in 3D:

Bivariate normal distribution truncated at IQ = 130

Here is the same distribution with simulated data points in 2D:

Expected values of IQ and reading comprehension when IQ ≥ 130

# Expected Values

In the picture above, the expected value (i.e., mean) for the IQ of the students in the gifted education program is 135.6. In the last post, I showed how to calculate this value.

The expected value (i.e., mean) for the reading comprehension score is 124.9. How is this calculated? The general method is fairly complicated and requires specialized software such as the R package tmvtnorm. However in the bivariate case with a single truncation, we can simply calculate the predicted reading comprehension score when IQ is 135.6:

$\dfrac{\hat{Y}-\mu_Y}{\sigma_Y}=\rho_{XY}\dfrac{X-\mu_X}{\sigma_X}$

$\dfrac{\hat{Y}-100}{15}=0.7\dfrac{135.6-100}{15}$

$\hat{Y}=124.9$

In R, the same answer is obtained via the tmvtnorm package:

library(tmvtnorm)
#Variable names

#Vector of Means
mu<-c(100,100)
names(mu)<-vNames;mu

#Vector of Standard deviations
sigma<-c(15,15)
names(sigma)<-vNames;sigma

#Correlation between IQ and Reading Comprehension
rho<-0.7

#Correlation matrix
R<-matrix(c(1,rho,rho,1),ncol=2)
rownames(R)<-colnames(R)<-vNames;R

#Covariance matrix
C<-diag(sigma)%*%R%*%diag(sigma)
rownames(C)<-colnames(C)<-vNames;C

#Vector of lower bounds (-Inf means negative infinity)
a<-c(130,-Inf)

#Vector of upper bounds (Inf means positive infinity)
b<-c(Inf,Inf)

#Means and covariance matrix of the truncated distribution
m<-mtmvnorm(mean=mu,sigma=C,lower=a,upper=b)
rownames(m$tvar)<-colnames(m$tvar)<-vNames;m

#Means of the truncated distribution
tmu<-m$tmean;tmu #Standard deviations of the truncated distribution tsigma<-sqrt(diag(m$tvar));tsigma

#Correlation matrix of the truncated distribution
tR<-cov2cor(m\$tvar);tR

In running the code above, we learn that the standard deviation of reading comprehension has shrunk from 15 in the general population to 11.28 in the truncated population. In addition, the correlation between IQ and reading comprehension has shrunk from 0.70 in the general population to 0.31 in the truncated population.

# Marginal cumulative distributions

Among the students in the gifted education program, what proportion have reading comprehension scores of 100 or less? The question can be answered with the marginal cumulative distribution function. That is, what proportion of the red truncated region is less than 100 in reading comprehension? Assuming that the code in the previous section has been run already, this code will yield the answer of about 1.3%:

#Proportion of students in the gifted program with reading comprehension of 100 or less
ptmvnorm(lowerx=c(-Inf,-Inf),upperx=c(Inf,100),mean=mu,sigma=C,lower=a,upper=b)

The mean, sigma, lower, and upper parameters define the truncated normal distribution. The lowerx and the upperx parameters define the lower and upper bounds of the subregion in question. In this case, there are no restrictions except an upper limit of 100 on the second axis (the Y-axis).

If we plot the cumulative distribution of reading comprehension scores in the gifted population, it is close to (but not the same as) that of the conditional distribution of reading comprehension at IQ = 135.6.

Marginal cumulative distribution function of the truncated bivariate normal distribution

# What proportion does the truncated distribution occupy in the untruncated distribution?

Imagine that in order to qualify for services for intellectual disability, a person must score 70 or below on an IQ test. Every three years, the person must undergo a re-evaluation. Suppose that the correlation between the original test and the re-evaluation test is $\rho=0.90$. If the entire population were given both tests, what proportion would score 70 or lower on both tests? What proportion would score below 70 on the first test but not on the second test? Such questions can be answered with the pmvnorm function from the mvtnorm package (which is a prerequiste of the tmvtnorm package and this thus already loaded if you ran the previous code blocks).

library(mvtnorm)
#Means
IQmu<-c(100,100)

#Standard deviations
IQsigma<-c(15,15)

#Correlation
IQrho<-0.9

#Correlation matrix
IQcor<-matrix(c(1,IQrho,IQrho,1),ncol=2)

#Covariance matrix
IQcov<-diag(IQsigma)%*%IQcor%*%diag(IQsigma)

#Proportion of the general population scoring 70 or less on both tests
pmvnorm(lower=c(-Inf,-Inf),upper=c(70,70),mean=IQmu,sigma=IQcov)

#Proportion of the general population scoring 70 or less on the first test but not on the second test
pmvnorm(lower=c(-Inf,70),upper=c(70,Inf),mean=IQmu,sigma=IQcov)

What are the means of these truncated distributions?

#Mean scores among people scoring 70 or less on both tests
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,-Inf),upper=c(70,70))

#Mean scores among people scoring 70 or less on the first test but not on the second test
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,70),upper=c(70,Inf))

Combining this information in a plot:

Thus, we can see that the multivariate truncated normal distribution can be used to answer a wide variety of questions. With a little creativity, we can apply it to many more kinds of questions.