In a previous post, I imagined that there was a gifted education program that had a strictly enforced selection procedure: everyone with an IQ of 130 or higher is admitted. With the (univariate) truncated normal distribution, we were able to calculate the mean of the selected group (mean IQ = 135.6).
Multivariate Truncated Normal Distributions
Reading comprehension has a strong relationship with IQ . What is the average reading comprehension score among students in the gifted education program? If we can assume that reading comprehension is normally distributed and the relationship between IQ and reading comprehension is linear , then we can answer this question using the multivariate truncated normal distribution. Portions of the multivariate normal distribution have been truncated (sliced off). In this case, the blue portion of the bivariate normal distribution of IQ and reading comprehension has been sliced off. The portion remaining (in red), is the distribution we are interested in. Here it is in 3D:
Here is the same distribution with simulated data points in 2D:
In the picture above, the expected value (i.e., mean) for the IQ of the students in the gifted education program is 135.6. In the last post, I showed how to calculate this value.
The expected value (i.e., mean) for the reading comprehension score is 124.9. How is this calculated? The general method is fairly complicated and requires specialized software such as the R package
tmvtnorm. However in the bivariate case with a single truncation, we can simply calculate the predicted reading comprehension score when IQ is 135.6:
In R, the same answer is obtained via the
library(tmvtnorm) #Variable names vNames<-c("IQ","Reading Comprehension") #Vector of Means mu<-c(100,100) names(mu)<-vNames;mu #Vector of Standard deviations sigma<-c(15,15) names(sigma)<-vNames;sigma #Correlation between IQ and Reading Comprehension rho<-0.7 #Correlation matrix R<-matrix(c(1,rho,rho,1),ncol=2) rownames(R)<-colnames(R)<-vNames;R #Covariance matrix C<-diag(sigma)%*%R%*%diag(sigma) rownames(C)<-colnames(C)<-vNames;C #Vector of lower bounds (-Inf means negative infinity) a<-c(130,-Inf) #Vector of upper bounds (Inf means positive infinity) b<-c(Inf,Inf) #Means and covariance matrix of the truncated distribution m<-mtmvnorm(mean=mu,sigma=C,lower=a,upper=b) rownames(m$tvar)<-colnames(m$tvar)<-vNames;m #Means of the truncated distribution tmu<-m$tmean;tmu #Standard deviations of the truncated distribution tsigma<-sqrt(diag(m$tvar));tsigma #Correlation matrix of the truncated distribution tR<-cov2cor(m$tvar);tR
In running the code above, we learn that the standard deviation of reading comprehension has shrunk from 15 in the general population to 11.28 in the truncated population. In addition, the correlation between IQ and reading comprehension has shrunk from 0.70 in the general population to 0.31 in the truncated population.
Marginal cumulative distributions
Among the students in the gifted education program, what proportion have reading comprehension scores of 100 or less? The question can be answered with the marginal cumulative distribution function. That is, what proportion of the red truncated region is less than 100 in reading comprehension? Assuming that the code in the previous section has been run already, this code will yield the answer of about 1.3%:
#Proportion of students in the gifted program with reading comprehension of 100 or less ptmvnorm(lowerx=c(-Inf,-Inf),upperx=c(Inf,100),mean=mu,sigma=C,lower=a,upper=b)
upper parameters define the truncated normal distribution. The
lowerx and the
upperx parameters define the lower and upper bounds of the subregion in question. In this case, there are no restrictions except an upper limit of 100 on the second axis (the Y-axis).
If we plot the cumulative distribution of reading comprehension scores in the gifted population, it is close to (but not the same as) that of the conditional distribution of reading comprehension at IQ = 135.6.
What proportion does the truncated distribution occupy in the untruncated distribution?
Imagine that in order to qualify for services for intellectual disability, a person must score 70 or below on an IQ test. Every three years, the person must undergo a re-evaluation. Suppose that the correlation between the original test and the re-evaluation test is . If the entire population were given both tests, what proportion would score 70 or lower on both tests? What proportion would score below 70 on the first test but not on the second test? Such questions can be answered with the
pmvnorm function from the
mvtnorm package (which is a prerequiste of the
tmvtnorm package and this thus already loaded if you ran the previous code blocks).
library(mvtnorm) #Means IQmu<-c(100,100) #Standard deviations IQsigma<-c(15,15) #Correlation IQrho<-0.9 #Correlation matrix IQcor<-matrix(c(1,IQrho,IQrho,1),ncol=2) #Covariance matrix IQcov<-diag(IQsigma)%*%IQcor%*%diag(IQsigma) #Proportion of the general population scoring 70 or less on both tests pmvnorm(lower=c(-Inf,-Inf),upper=c(70,70),mean=IQmu,sigma=IQcov) #Proportion of the general population scoring 70 or less on the first test but not on the second test pmvnorm(lower=c(-Inf,70),upper=c(70,Inf),mean=IQmu,sigma=IQcov)
What are the means of these truncated distributions?
#Mean scores among people scoring 70 or less on both tests mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,-Inf),upper=c(70,70)) #Mean scores among people scoring 70 or less on the first test but not on the second test mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,70),upper=c(70,Inf))
Combining this information in a plot: