Statistics

# Two visualizations for explaining “variance explained”

In my introductory statistics class, I feel uneasy when I have explain what variance explained means. The term has two things I don’t like. First, I don’t like variance very much. I feel much more comfortable with standard deviations. I understand that at a deep level variance is a more fundamental concept than the standard deviation. However, variance is a poor descriptive statistic because there is no direct visual analog for variance in a probability distribution plot. In contrast, the standard deviation illustrates very clearly how much scores typically deviate from the mean. So, variance explained is hard to grasp in part because variance is hard to visualize.

The second thing I don’t like about variance explained is the whole “explained” business. As I mentioned in my last post, variance explained does not actually mean that we have explained anything, at least in a causal sense. That is, it does not imply that we know what is going on. It simply means that we can use one or more variables to predict things more accurately than before.

In many models, if X is correlated with Y, X can be said to “explain” variance in Y even though X does not really cause Y. However, in some situations the term variance explained is accurate in every sense:

X causes Y

In the model above, the arrow means that X really is a partial cause of Y. Why does Y vary? Because of variability in X, at least in part. In this example, 80% of Y’s variance is due to X, with the remaining variance due to something else (somewhat misleadingly termed error). It is not an “error” in that something is wrong or that someone is making a mistake. It is merely that which causes our predictions of Y to be off. Prediction error is probably not a single variable. It it likely to be the sum total of many influences.

Because X and error are uncorrelated z-scores in this example, the path coefficients are equal to the correlations with Y. Squaring the correlation coefficients yields the variance explained. The coefficients for X and error are actually the square roots of .8 and .2, respectively. Squaring the coefficients tells us that X explains 80% of the variance in Y and error explains the rest.

# Visualizing Variance Explained

Okay, if X predicts Y, then the variance explained is equal to the correlation coefficient squared. Unfortunately, this is merely a formula. It does not help us understand what it means. Perhaps this visualization will help:

Variance Explained

If you need to guess every value of Y but you know nothing about Y except that it has a mean of zero, then you should guess zero every time. You’ll be wrong most of the time, but pursuing other strategies will result in even larger errors. The variance of your prediction errors will be equal to the variance of Y. In the picture above, this corresponds to a regression line that passes through the mean of Y and has a slope of zero. No matter what X is, you guess that Y is zero. The squared vertical distance from Y to the line is represented by the translucent squares. The average area of the squares is the variance of Y.

If you happen to know the value of X each time you need to guess what Y will be, then you can use a regression equation to make a better guess. Your prediction of Y is called Y-hat (Ŷ):

$\hat{Y}=b_0+b_1X=0+\sqrt{0.80}X\approx 0.89X$

When X and Y have the same variance, the slope of the regression line is equal to the correlation coefficient, 0.89. The distance from Ŷ (the predicted value of Y) to the actual value of Y is the prediction error. In the picture above, the variance of the prediction errors (0.2) is the average size of the squares when the slope is equal to the correlation coefficient.

Thus, when X is not used to predict Y, our prediction errors have a variance of 1. When we do use X to predict Y, the average size of the prediction errors shrinks from 1 to 0.2, an 80% reduction. This is what is meant when we say that “X explains 80% of the variance in Y.” It is the proportion by which the variance of the prediction errors shrinks.

# An alternate visualization

Suppose that we flip 50 coins and record how many heads there are. We do this over and over. The values we record constitute the variable Y. The number of heads we get each time we flip a coin happens to have a binomial distribution. The mean of a binomial distribution is determined by the probability p of an event occurring on a single trial (i.e., getting a head on a single toss) and the number of events k (i.e., the number of coins thrown). As k increases, the binomial distribution begins to resemble the normal distribution. The probability p of getting a head on any one coin toss is 0.5 and the number of coins k is 50. The mean number of heads over the long run is:

$\mu = pk=0.5*50=25$

The variance of the binomial distribution:

$\sigma^2 = p(1-p)k=0.5*(1-0.5)*50=12.5$

Before we toss the coins, we should guess that we will toss an average number of heads, 25. We will be wrong much of the time but our prediction errors will be as small as they can be, over the long run. The variance of our prediction errors is equal to the variance of Y, 12.5.

Now suppose that after tossing 80% of our coins (i.e., 40 coins), we count the number of heads. This value is recorded as variable X. The remaining 20% of the coins (10 coins) are then tossed and the total number of heads is counted from all 50 coins. We can use a regression equation to predict Y from X. The intercept will be the mean number of heads from the remaining 10 coins:

$\hat{Y} = b_0+b_1X=5+X$

In the diagram below, each peg represents a coin toss. If the outcome is heads, the dot moves right. If the outcome is tails, the dot moves left. The purple line represents the probability distribution of Y before any coin has been tossed.

X explains 80% of the variance in Y.

When the dot gets to the red line (after 40 tosses or 80% of the total), we can make a new guess as to what Y is going to be. This conditional distribution is represented by a blue line. The variance of the conditional distribution has a mean equal to Ŷ, with a variance of 2.5 (the variance of the 10 remaining coins).

The variability in Y is caused by the outcomes of 50 coin tosses. If 80% of those coins are the variable X, then X explains 80% of the variance in Y. The remaining 10 coins represent the variability of Y that is not determined by X (i.e., the error term). They determine 20% of the variance in Y.

If X represented only the first 20 of 50 coins, then X would explain 40% of the variance in Y.

X explains 40% of the variance in Y.

Standard
Statistics

# Unfortunate statistical terms

I like most of the technical terms we use in statistics. However, there are a few of them that I wish were easier to teach and remember. Many others have opined on such matters. This is my list of complaints:

• Statistical significance: This term is so universally hated I am surprised that we haven’t held a convention and banned its use. How many journalists have been mislead by researchers’ technical use of significance? I wish we said something like “not merely random” or “probably not zero.”
• Type I/Type II error: It is hard to remember which is which because the terms don’t convey any clues as to what they mean. I wish more informative metaphors were used such as false hit and false miss.
• Power: Statistical power refers to the probability that the null hypothesis will be rejected, provided that the null hypothesis is false. The term is not self-explanatory and requires memorization! I wish we used a better term such as true hit rate or false null rejection rate. While we’re at it, α and β are not much better. False hit rate (or true null rejection rate) and false miss rate (or false null retention rate) would be easier to remember.
• Prediction error: The word error in English typically refers to an action that results in harm that could have been avoided if better choices had been made. In the context of statistical models, prediction errors are what you get wrong even though you have done everything right! I wish there were a word that referred to actions that were done in good faith yet resulted in unforeseeable harm. In this case, we already have a perfectly good substitute term that is widely used: disturbance. I suppose that the connotations of disturbance could generate different misunderstandings but in my estimation they are not as bad as those generated by error. I wish that we could just use the term residuals but that refers to something slightly different: the estimate of an error (residual:error::statistic:parameter). We can only know the errors if we know the true model parameters.
• Variance explained: This term works if the predictor is a cause of the criterion variable. However, when it is simply a correlate, it misleadingly suggests that we now understand what is going on. I wish the term were something more neutral such as variance predicted.
• Moderator/Mediator: At least in English, these terms sound so much alike that they are easily confused. I think that we should dump moderator along with related terms interaction effect, simple main effect, and simple slope. I think that the term conditional effects is more descriptive and straightforward.
• Biased: This word is hard to use in its technical sense when talking to non-statisticians. It sounds like we are talking about bigoted statistics! Unfortunately I can’t think of good alternative to it (though I can think of some awkward ones like stable inaccuracy).
• Degrees of freedom: For me, this concept is extremely difficult to explain properly in an introductory course. Students are confused about what degrees have to do with it (or for that matter, freedom). I don’t know if I have a good replacement term (independent dimensions? non-redundancy index? matrix rank?).
• True score: This term sounds like it refers to the Aristotelian truth when in fact it is merely the long-term average score if there were no carryover effects of repeated measurement. Thus, a person’s true score on one IQ test might be quite different from the same person’s true score on another IQ test. Neither true score refers to the person’s “true cognitive ability.” To avoid this confusion, I would prefer something like the individual expected value, or IEV for short.
• Reliability: In typical usage, reliability refers to morally desirable traits such as trustworthiness and truthfulness. When statisticians refer to the reliability of scores or experimental results, to the untrained ear it probably sounds like we are talking about validity. I would prefer to talk about stability, consistency, or precision instead.

I am sure that there are many more!

Standard

# How common is it to have no academic weaknesses?

I’m afraid that the question posed by the title does not have a single answer. It depends on how we define and measure academic performance.

Let’s sidestep some difficult questions about what exactly an “academic deficit” is and for the sake of convenience pretend that it is a score at least 1 standard deviation below the mean on a well normed test administered by a competent psychologist with good clinical skills.

Suppose that we start with the 9 core WJ III achievement tests (the answers will not be all that different with the new WJ IV):

What is the percentage of the population that does not have any score below 85? If we can assume that the scores are multivariate normal, the answer can be found using data simulation or via the cumulative density function of the multivariate normal distribution. I gave examples of both methods in the previous post. If we use the correlation matrix for the 6 to 9 age group of the WJ III NU, about 47% of the population has no academic scores below 85.

Using the same methods we can estimate what percent of the population has no academic scores below various thresholds. Subtracting these numbers from 100%, we can see that fairly large proportions have at least one low score.

# What proportion of people with average cognitive scores have no academic weaknesses?

The numbers in the table above include people with very low cognitive ability. It would be more informative if we could control for a person’s measured cognitive abilities.

Suppose that an individual has index scores of exactly 100 for all 14 subtests that are used to calculate the WJ III GIA Extended. We can calculate the means and the covariance matrix of the achievement tests for all people with this particular cognitive profile. We will make use of the conditional multivariate normal distribution. As explained here (or here), we partition the academic tests $(\mathbf{X}_1)$ and the cognitive predictor tests $(\mathbf{X}_2)$ like so:

$\begin{pmatrix}\mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix}\sim\mathcal{N}\left(\begin{pmatrix}\boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{pmatrix},\begin{pmatrix}\mathbf{\Sigma}_{11} & \mathbf{\Sigma}_{12} \\ \mathbf{\Sigma}_{21} & \mathbf{\Sigma}_{22}\end{pmatrix}\right)$

• $\boldsymbol{\mu}_1$ and $\boldsymbol{\mu}_2$ are the mean vectors for the academic and cognitive variables, respectively.
• $\mathbf{\Sigma}_{11}$ and $\mathbf{\Sigma}_{22}$ are the covariances matrices of academic and cognitive variables, respectively.
• $\mathbf{\Sigma}_{12}$ is the matrix of covariances between the academic and cognitive variables.

If the cognitive variables have the vector of particular values $\mathbf{x}_2$, then the conditional mean vector of the academic variables $(\boldsymbol{\mu}_{1|2})$ is:

$\boldsymbol{\mu}_{1|2}=\boldsymbol{\mu}_1+\mathbf{\Sigma}_{12}\mathbf{\Sigma}^{-1}_{22}(\mathbf{x}_2-\boldsymbol{\mu}_2)$

The conditional covariance matrix:
$\mathbf{\Sigma}_{1|2}=\mathbf{\Sigma}_{11}-\mathbf{\Sigma}_{12}\mathbf{\Sigma}^{-1}_{22}\mathbf{\Sigma}_{21}$

If we can assume multivariate normality, we can use these equations, to estimate the proportion of people with no scores below any threshold on any set of scores conditioned on any set of predictor scores. In this example, about 51% of people with scores of exactly 100 on all 14 cognitive predictors have no scores below 85 on the 9 academic tests. About 96% of people with this cognitive profile have no scores below 70.

Because there is an extremely large number of possible cognitive profiles, I cannot show what would happen with all of them. Instead, I will show what happens with all of the perfectly flat profiles from all 14 cognitive scores equal to 70 to all 14 cognitive scores equal to 130.

Here is what happens with the same procedure when the threshold is 70 for the academic scores:

Here is the R code I used to perform the calculations. You can adapt it to other situations fairly easily (different tests, thresholds, and profiles).

library(mvtnorm)
WJ <- matrix(c(
1,0.49,0.31,0.46,0.57,0.28,0.37,0.77,0.36,0.15,0.24,0.49,0.25,0.39,0.61,0.6,0.53,0.53,0.5,0.41,0.43,0.57,0.28, #Verbal Comprehension
0.49,1,0.27,0.32,0.47,0.26,0.32,0.42,0.25,0.21,0.2,0.41,0.21,0.28,0.38,0.43,0.31,0.36,0.33,0.25,0.29,0.4,0.18, #Visual-Auditory Learning
0.31,0.27,1,0.25,0.33,0.18,0.21,0.28,0.13,0.16,0.1,0.33,0.13,0.17,0.25,0.22,0.18,0.21,0.19,0.13,0.25,0.31,0.11, #Spatial Relations
0.46,0.32,0.25,1,0.36,0.17,0.26,0.44,0.19,0.13,0.26,0.31,0.18,0.36,0.4,0.36,0.32,0.29,0.31,0.27,0.22,0.33,0.2, #Sound Blending
0.57,0.47,0.33,0.36,1,0.29,0.37,0.49,0.28,0.16,0.23,0.57,0.24,0.35,0.4,0.44,0.36,0.38,0.4,0.34,0.39,0.53,0.27, #Concept Formation
0.28,0.26,0.18,0.17,0.29,1,0.35,0.25,0.36,0.17,0.27,0.29,0.53,0.22,0.37,0.32,0.52,0.42,0.32,0.49,0.42,0.37,0.61, #Visual Matching
0.37,0.32,0.21,0.26,0.37,0.35,1,0.3,0.24,0.13,0.22,0.33,0.21,0.35,0.39,0.34,0.38,0.38,0.36,0.33,0.38,0.43,0.36, #Numbers Reversed
0.77,0.42,0.28,0.44,0.49,0.25,0.3,1,0.37,0.15,0.23,0.43,0.23,0.37,0.56,0.55,0.51,0.47,0.47,0.39,0.36,0.51,0.26, #General Information
0.36,0.25,0.13,0.19,0.28,0.36,0.24,0.37,1,0.1,0.22,0.21,0.38,0.26,0.26,0.33,0.4,0.28,0.27,0.39,0.21,0.25,0.32, #Retrieval Fluency
0.15,0.21,0.16,0.13,0.16,0.17,0.13,0.15,0.1,1,0.06,0.16,0.17,0.09,0.11,0.09,0.13,0.1,0.12,0.13,0.07,0.12,0.07, #Picture Recognition
0.24,0.2,0.1,0.26,0.23,0.27,0.22,0.23,0.22,0.06,1,0.22,0.35,0.2,0.16,0.22,0.25,0.21,0.19,0.26,0.17,0.19,0.21, #Auditory Attention
0.49,0.41,0.33,0.31,0.57,0.29,0.33,0.43,0.21,0.16,0.22,1,0.2,0.3,0.33,0.38,0.29,0.31,0.3,0.25,0.42,0.47,0.25, #Analysis-Synthesis
0.25,0.21,0.13,0.18,0.24,0.53,0.21,0.23,0.38,0.17,0.35,0.2,1,0.15,0.19,0.22,0.37,0.21,0.2,0.4,0.23,0.19,0.37, #Decision Speed
0.39,0.28,0.17,0.36,0.35,0.22,0.35,0.37,0.26,0.09,0.2,0.3,0.15,1,0.39,0.36,0.32,0.3,0.3,0.3,0.25,0.33,0.23, #Memory for Words
0.61,0.38,0.25,0.4,0.4,0.37,0.39,0.56,0.26,0.11,0.16,0.33,0.19,0.39,1,0.58,0.59,0.64,0.5,0.48,0.46,0.52,0.42, #Letter-Word Identification
0.6,0.43,0.22,0.36,0.44,0.32,0.34,0.55,0.33,0.09,0.22,0.38,0.22,0.36,0.58,1,0.52,0.52,0.47,0.42,0.43,0.49,0.36, #Passage Comprehension
0.53,0.36,0.21,0.29,0.38,0.42,0.38,0.47,0.28,0.1,0.21,0.31,0.21,0.3,0.64,0.52,0.58,1,0.5,0.49,0.46,0.47,0.49, #Spelling
0.5,0.33,0.19,0.31,0.4,0.32,0.36,0.47,0.27,0.12,0.19,0.3,0.2,0.3,0.5,0.47,0.48,0.5,1,0.44,0.41,0.46,0.36, #Writing Samples
0.41,0.25,0.13,0.27,0.34,0.49,0.33,0.39,0.39,0.13,0.26,0.25,0.4,0.3,0.48,0.42,0.65,0.49,0.44,1,0.38,0.37,0.55, #Writing Fluency
0.43,0.29,0.25,0.22,0.39,0.42,0.38,0.36,0.21,0.07,0.17,0.42,0.23,0.25,0.46,0.43,0.42,0.46,0.41,0.38,1,0.57,0.51, #Calculation
0.57,0.4,0.31,0.33,0.53,0.37,0.43,0.51,0.25,0.12,0.19,0.47,0.19,0.33,0.52,0.49,0.43,0.47,0.46,0.37,0.57,1,0.46, #Applied Problems
0.28,0.18,0.11,0.2,0.27,0.61,0.36,0.26,0.32,0.07,0.21,0.25,0.37,0.23,0.42,0.36,0.59,0.49,0.36,0.55,0.51,0.46,1), nrow= 23, byrow=TRUE) #Math Fluency
WJNames <- c("Verbal Comprehension", "Visual-Auditory Learning", "Spatial Relations", "Sound Blending", "Concept Formation", "Visual Matching", "Numbers Reversed", "General Information", "Retrieval Fluency", "Picture Recognition", "Auditory Attention", "Analysis-Synthesis", "Decision Speed", "Memory for Words", "Letter-Word Identification", "Passage Comprehension", "Reading Fluency", "Spelling", "Writing Samples", "Writing Fluency", "Calculation", "Applied Problems", "Math Fluency")
rownames(WJ) <- colnames(WJ) <- WJNames

#Number of tests
k<-length(WJNames)

#Means and standard deviations of tests
mu<-rep(100,k)
sd<-rep(15,k)

#Covariance matrix
sigma<-diag(sd)%*%WJ%*%diag(sd)
colnames(sigma)<-rownames(sigma)<-WJNames

#Vector identifying predictors (WJ Cog)
p<-seq(1,14)

#Threshold for low scores
Threshold<-85

#Proportion of population who have no scores below the threshold
pmvnorm(lower=rep(Threshold,length(WJNames[-p])),upper=rep(Inf,length(WJNames[-p])),sigma=sigma[-p,-p],mean=mu[-p])[1]

#Predictor test scores for an individual
x<-rep(100,length(p))
names(x)<-WJNames[p]

#Condition means and covariance matrix
condMu<-c(mu[-p] + sigma[-p,p] %*% solve(sigma[p,p]) %*% (x-mu[p]))
condSigma<-sigma[-p,-p] - sigma[-p,p] %*% solve(sigma[p,p]) %*% sigma[p,-p]

#Proportion of people with the same predictor scores as this individual who have no scores below the threshold
pmvnorm(lower=rep(Threshold,length(WJNames[-p])),upper=rep(Inf,length(WJNames[-p])),sigma=condSigma,mean=condMu)[1]


Standard

# How unusual is it to have multiple scores below a threshold?

In psychological assessment, it is common to specify a threshold at which a score is considered unusual (e.g., 2 standard deviations above or below the mean). If we can assume that the scores are roughly normal, it is easy to estimate the proportion of people with scores below the threshold we have set. If the threshold is 2 standard deviations below the mean, then the Excel function NORMSDIST will tell us the answer:

=NORMSDIST(-2)

=0.023

In R, the pnorm function gives the same answer:

pnorm(-2)

How unusual is it to have multiple scores below the threshold? The answer depends on how correlated the scores are. If we can assume that the scores are multivariate normal, Crawford and colleagues (2007) show us how to obtain reasonable estimates using simulated data. Here is a script in R that depends on the mvtnorm package. Suppose that the 10 subtests of the WAIS-IV have correlations as depicted below. Because the subtests have a mean of 10 and a standard deviation of 3, the scores are unusually low if 4 or lower.

#WAIS-IV subtest names
WAISSubtests <- c("BD", "SI", "DS", "MR", "VO", "AR", "SS", "VP", "IN", "CD")

# WAIS-IV correlations
WAISCor <- rbind(
c(1.00,0.49,0.45,0.54,0.45,0.50,0.41,0.64,0.44,0.40), #BD
c(0.49,1.00,0.48,0.51,0.74,0.54,0.35,0.44,0.64,0.41), #SI
c(0.45,0.48,1.00,0.47,0.50,0.60,0.40,0.40,0.43,0.45), #DS
c(0.54,0.51,0.47,1.00,0.51,0.52,0.39,0.53,0.49,0.45), #MR
c(0.45,0.74,0.50,0.51,1.00,0.57,0.34,0.42,0.73,0.41), #VO
c(0.50,0.54,0.60,0.52,0.57,1.00,0.37,0.48,0.57,0.43), #AR
c(0.41,0.35,0.40,0.39,0.34,0.37,1.00,0.38,0.34,0.65), #SS
c(0.64,0.44,0.40,0.53,0.42,0.48,0.38,1.00,0.43,0.37), #VP
c(0.44,0.64,0.43,0.49,0.73,0.57,0.34,0.43,1.00,0.34), #IN
c(0.40,0.41,0.45,0.45,0.41,0.43,0.65,0.37,0.34,1.00)) #CD
rownames(WAISCor) <- colnames(WAISCor) <- WAISSubtests

#Means
WAISMeans<-rep(10,length(WAISSubtests))

#Standard deviations
WAISSD<-rep(3,length(WAISSubtests))

#Covariance Matrix
WAISCov<-WAISCor*WAISSD%*%t(WAISSD)

#Sample size
SampleSize<-1000000

library(mvtnorm)

#Make simulated data
d<-rmvnorm(n=SampleSize,mean=WAISMeans,sigma=WAISCov)
#To make this more realistic, you can round all scores to the nearest integer (d<-round(d))

#Threshold for abnormality
Threshold<-4

#Which scores are less than or equal to threshold
Abnormal<- d<=Threshold

#Number of scores less than or equal to threshold
nAbnormal<-rowSums(Abnormal)

#Frequency distribution table
p<-c(table(nAbnormal)/SampleSize)

#Plot
barplot(p,axes=F,las=1,
xlim=c(0,length(p)*1.2),ylim=c(0,1),
bty="n",pch=16,col="royalblue2",
xlab="Number of WAIS-IV subtest scores less than or equal to 4",
ylab="Proportion")
axis(2,at=seq(0,1,0.1),las=1)
text(x=0.7+0:10*1.2,y=p,labels=formatC(p,digits=2),cex=0.7,pos=3,adj=0.5)

The code produces this graph:

# Using the multivariate normal distribution

The simulation method works very well, especially if the sample size is very large. An alternate method that gives more precise numbers is to estimate how much of the multivariate normal distribution is within certain bounds. That is, we find all of the regions of the multivariate normal distribution in which one and only one test is below a threshold and then add up all the probabilities. The process is repeated to find all regions in which two and only two tests are below a threshold. Repeat the process, with 3 tests, 4 tests, and so on. This is tedious to do by hand but only takes a few lines of code do automatically.

AbnormalPrevalance<-function(Cor,Mean=0,SD=1,Threshold){
require(mvtnorm)
k<-nrow(Cor)
p<-rep(0,k)
zThreshold<-(Threshold-Mean)/SD
for (n in 1:k){
combos<-combn(1:k,n)
ncombos<-ncol(combos)
for (i in 1:ncombos){
u<-rep(Inf,k)
u[combos[,i]]<-zThreshold
l<-rep(-Inf,k)
l[seq(1,k)[-combos[,i]]]<-zThreshold
p[n]<-p[n]+pmvnorm(lower=l,upper=u,mean=rep(0,k),sigma=Cor)[1]
}
}
p<-c(1-sum(p),p)
names(p)<-0:k

barplot(p,axes=F,las=1,xlim=c(0,length(p)*1.2),ylim=c(0,1),
bty="n",pch=16,col="royalblue2",
xlab=bquote("Number of scores less than or equal to " * .(Threshold)),
ylab="Proportion")
axis(2,at=seq(0,1,0.1),las=1)
return(p)
}
Proportions<-AbnormalPrevalance(Cor=WAISCor,Mean=10,SD=3,Threshold=4)

Using this method, the results are nearly the same but slightly more accurate. If the number of tests is large, the code can take a long time to run.

Standard
Cognitive Assessment

# MS Word Trick: Make your headings stay on the same page as the paragraph below

When I write psychological evaluation reports, I start with a template that has headings for the various sections. Until now, I always had to check the document before printing to make sure that no headings were alone on the last line of the page, with its accompanying paragraph on the next page. It did not take much time to fix the problems, but it was a pain to re-paginate the report if I made future edits. Its main cost was a bit of worry each time I finished a report.

All these years it never occurred to me to ask whether Microsoft engineers had anticipated this problem!

In Microsoft Word, there is an option to keep a paragraph on the same page as the next paragraph. I use Word 2010 for Windows so your experience might be slightly different. I select the heading, and click the format button in the Paragraph section.

Then click the Line and Page Breaks tab.

Then check the Keep with next box.

Now right-click Heading 1 on the Styles portion of the Home tab on the ribbon. Select Update Heading 1 to Match Selection.

Now everything you have marked as a Level 1 heading will stay with its accompanying paragraph. You can repeat the process for Level 2 and Level 3 headings, if needed.

I have now updated my template so that the headings behave properly.

A more thorough treatment of page breaks and other pagination tricks can be found here.

Standard

# Using the multivariate truncated normal distribution

In a previous post, I imagined that there was a gifted education program that had a strictly enforced selection procedure: everyone with an IQ of 130 or higher is admitted. With the (univariate) truncated normal distribution, we were able to calculate the mean of the selected group (mean IQ = 135.6).

# Multivariate Truncated Normal Distributions

Reading comprehension has a strong relationship with IQ $(\rho\approx 0.70)$. What is the average reading comprehension score among students in the gifted education program? If we can assume that reading comprehension is normally distributed $(\mu=100, \sigma=15)$ and the relationship between IQ and reading comprehension is linear $(\rho=0.70)$, then we can answer this question using the multivariate truncated normal distribution. Portions of the multivariate normal distribution have been truncated (sliced off). In this case, the blue portion of the bivariate normal distribution of IQ and reading comprehension has been sliced off. The portion remaining (in red), is the distribution we are interested in. Here it is in 3D:

Bivariate normal distribution truncated at IQ = 130

Here is the same distribution with simulated data points in 2D:

Expected values of IQ and reading comprehension when IQ ≥ 130

# Expected Values

In the picture above, the expected value (i.e., mean) for the IQ of the students in the gifted education program is 135.6. In the last post, I showed how to calculate this value.

The expected value (i.e., mean) for the reading comprehension score is 124.9. How is this calculated? The general method is fairly complicated and requires specialized software such as the R package tmvtnorm. However in the bivariate case with a single truncation, we can simply calculate the predicted reading comprehension score when IQ is 135.6:

$\dfrac{\hat{Y}-\mu_Y}{\sigma_Y}=\rho_{XY}\dfrac{X-\mu_X}{\sigma_X}$

$\dfrac{\hat{Y}-100}{15}=0.7\dfrac{135.6-100}{15}$

$\hat{Y}=124.9$

In R, the same answer is obtained via the tmvtnorm package:

library(tmvtnorm)
#Variable names

#Vector of Means
mu<-c(100,100)
names(mu)<-vNames;mu

#Vector of Standard deviations
sigma<-c(15,15)
names(sigma)<-vNames;sigma

#Correlation between IQ and Reading Comprehension
rho<-0.7

#Correlation matrix
R<-matrix(c(1,rho,rho,1),ncol=2)
rownames(R)<-colnames(R)<-vNames;R

#Covariance matrix
C<-diag(sigma)%*%R%*%diag(sigma)
rownames(C)<-colnames(C)<-vNames;C

#Vector of lower bounds (-Inf means negative infinity)
a<-c(130,-Inf)

#Vector of upper bounds (Inf means positive infinity)
b<-c(Inf,Inf)

#Means and covariance matrix of the truncated distribution
m<-mtmvnorm(mean=mu,sigma=C,lower=a,upper=b)
rownames(m$tvar)<-colnames(m$tvar)<-vNames;m

#Means of the truncated distribution
tmu<-m$tmean;tmu #Standard deviations of the truncated distribution tsigma<-sqrt(diag(m$tvar));tsigma

#Correlation matrix of the truncated distribution
tR<-cov2cor(m\$tvar);tR


In running the code above, we learn that the standard deviation of reading comprehension has shrunk from 15 in the general population to 11.28 in the truncated population. In addition, the correlation between IQ and reading comprehension has shrunk from 0.70 in the general population to 0.31 in the truncated population.

# Marginal cumulative distributions

Among the students in the gifted education program, what proportion have reading comprehension scores of 100 or less? The question can be answered with the marginal cumulative distribution function. That is, what proportion of the red truncated region is less than 100 in reading comprehension? Assuming that the code in the previous section has been run already, this code will yield the answer of about 1.3%:

#Proportion of students in the gifted program with reading comprehension of 100 or less
ptmvnorm(lowerx=c(-Inf,-Inf),upperx=c(Inf,100),mean=mu,sigma=C,lower=a,upper=b)

The mean, sigma, lower, and upper parameters define the truncated normal distribution. The lowerx and the upperx parameters define the lower and upper bounds of the subregion in question. In this case, there are no restrictions except an upper limit of 100 on the second axis (the Y-axis).

If we plot the cumulative distribution of reading comprehension scores in the gifted population, it is close to (but not the same as) that of the conditional distribution of reading comprehension at IQ = 135.6.

Marginal cumulative distribution function of the truncated bivariate normal distribution

# What proportion does the truncated distribution occupy in the untruncated distribution?

Imagine that in order to qualify for services for intellectual disability, a person must score 70 or below on an IQ test. Every three years, the person must undergo a re-evaluation. Suppose that the correlation between the original test and the re-evaluation test is $\rho=0.90$. If the entire population were given both tests, what proportion would score 70 or lower on both tests? What proportion would score below 70 on the first test but not on the second test? Such questions can be answered with the pmvnorm function from the mvtnorm package (which is a prerequiste of the tmvtnorm package and this thus already loaded if you ran the previous code blocks).

library(mvtnorm)
#Means
IQmu<-c(100,100)

#Standard deviations
IQsigma<-c(15,15)

#Correlation
IQrho<-0.9

#Correlation matrix
IQcor<-matrix(c(1,IQrho,IQrho,1),ncol=2)

#Covariance matrix
IQcov<-diag(IQsigma)%*%IQcor%*%diag(IQsigma)

#Proportion of the general population scoring 70 or less on both tests
pmvnorm(lower=c(-Inf,-Inf),upper=c(70,70),mean=IQmu,sigma=IQcov)

#Proportion of the general population scoring 70 or less on the first test but not on the second test
pmvnorm(lower=c(-Inf,70),upper=c(70,Inf),mean=IQmu,sigma=IQcov)

What are the means of these truncated distributions?

#Mean scores among people scoring 70 or less on both tests
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,-Inf),upper=c(70,70))

#Mean scores among people scoring 70 or less on the first test but not on the second test
mtmvnorm(mean=IQmu,sigma=IQcov,lower=c(-Inf,70),upper=c(70,Inf))


Combining this information in a plot:

Thus, we can see that the multivariate truncated normal distribution can be used to answer a wide variety of questions. With a little creativity, we can apply it to many more kinds of questions.

Standard

# Using the truncated normal distribution

The term truncated normal distribution may sound highly technical but it is actually fairly simple and has many practical applications. If the math below is daunting, be assured that it is not necessary to understand the notation and the technical details. I have created a user-friendly spreadsheet that performs all the calculations automatically.

# The mean of a truncated normal distribution

Imagine that your school district has a gifted education program. All students in the program have an IQ of 130 or higher. What is the average IQ of this group? Assume that in your school district, IQ is normally distributed with a mean of 100 and a standard deviation of 15.

Questions like this one can be answered by calculating the mean of the truncated normal distribution. The truncated normal distribution is a normal distribution in which one or both ends have been sliced off (i.e., truncated). In this case, everything below 130 has been sliced off (and there is no upper bound).

Four parameters determine the properties of the truncated normal distribution:

μ = mean of the normal distribution (before truncation)
σ = standard deviation of the normal distribution (before truncation)
a = the lower bound of the distribution (can be as low as −∞)
b = the upper bound of the distribution (can be as high as +∞)

The formula for the mean of a truncated distribution is a bit of a mess but can be simplified by finding the z-scores associated with the lower and upper bounds of the distribution:

$z_a=\dfrac{a-\mu}{\sigma}$

$z_b=\dfrac{b-\mu}{\sigma}$

The expected value of the truncated distribution (i.e., the mean):
$E(X)=\mu+\sigma\dfrac{\phi(z_a)-\phi(z_b)}{\Phi(z_b)-\Phi(z_a)}$

Where $\phi$ is the probability density function of the standard normal distribution (NORMDIST(z,0,1,FALSE) in Excel, dnorm(z) in R) and $\Phi$ is the cumulative distribution function of the standard normal distribution (NORMSDIST(z) in Excel, pnorm(z) in R).

This spreadsheet calculates the mean (and standard deviation) of a truncated distribution. See the part below the plot that says “Truncated Normal Distribution.”

In R you could make a function to calculate the mean of a truncated distribution like so:

MeanNormalTruncated<-function(mu=0,sigma=1,a=-Inf,b=Inf){
mu+sigma*(dnorm((a-mu)/sigma)-dnorm((b-mu)/sigma))/(pnorm((b-mu)/sigma)-pnorm((a-mu)/sigma))
}

#Example: Find the mean of a truncated normal distribution with a mu = 100, sigma = 15, and lower bound = 130
MeanNormalTruncated(mu=100,sigma=15,a=130)

# The cumulative distribution function of the truncated normal distribution

Suppose that we wish to know the proportion of students in the same gifted education program who score 140 or more. The cumulative truncated normal distribution function tells us the proportion of the distribution that is less than a particular value.

$cdf=\dfrac{\Phi(z_x)-\Phi(z_a)}{\Phi(z_b)-\Phi(z_a)}$

Where $z_x = \dfrac{X-\mu}{\sigma}$

In the previously mentioned spreadsheet, the cumulative distribution function is the proportion of the shaded region that is less than the value you specify.

You can create your own cumulative distribution function for the truncated normal distribution in R like so:

cdfNormalTruncated<-function(x=0,mu=0,sigma=1,a=-Inf,b=Inf){
(pnorm((x-mu)/sigma)-pnorm((a-mu)/sigma))/(pnorm((b-mu)/sigma)-pnorm((a-mu)/sigma))
}
#Example: Find the proportion of the distribution less than 140
cdfNormalTruncated(x=140,mu=100,sigma=15,a=130)

In this case, the cumulative distribution function returns approximately 0.8316. Subtracting from 1, gives the proportion of scores 140 and higher: 0.1684. This means that about 17% of students in the gifted program can be expected to have IQ scores of 140 or more.1

# The truncated normal distribution in R

A fuller range of functions related to the truncated normal distribution can be found in the truncnorm package in R, including the expected value (mean), variance, pdf, cdf, quantile, and random number generation functions.

1 In the interest of precision, I need to say that because IQ scores are rounded to the nearest integer, a slight adjustment needs to be made. The true lower bound of the truncated distribution is not 130 but 129.5. Furthermore, we want the proportion of scores 139.5 and higher, not 140 and higher. This means that the expected proportion of students with IQ scores of “140” and higher in the gifted program is about 0.1718 instead of 0.1684. Of course, there is little difference between these estimates and such precision is not usually needed for “back-of-the-envelope” estimates such as this one.
Standard

# Beyond IQ, a new blog about psychological assessment

Beyond IQ is an excellent new blog about psychological assessment. It was created about a month ago by Smadar Sapir Yogev, an educational psychologist working in Jerusalem.

Smadar posts in both Hebrew and in English on a wide range of assessment-related topics. She has created an excellent nine-part series on CHC theory, of which six parts have been published. My favorite in the series so far is the presentation on long-term memory, which presents CHC theory memory abilities alongside other aspects of memory processes identified by cognitive psychologists.

I look forward to reading Smadar’s future posts on a diverse range of topics.

Standard

# IQ, the death penalty, and me

Today the Supreme Court of the United States ruled that, in death penalty cases, the state of Florida must take into account the inherent imprecision of IQ tests.

Why are IQ tests used in death penalty cases? It is unconstitutional to execute a person deemed to be intellectually disabled (Intellectual disability is the current term for what was previously known as mental retardation.). Diagnosing intellectual disabilities is a complex matter but the diagnosis hinges to a large degree on the person’s performance on a well-constructed IQ test. Although high-quality IQ tests are more reliable than most psychological measures, even the best IQ tests are imperfectly precise. There is a potentially large risk that a person with an observed score slightly above the threshold set by Florida law may have a “true score” that is below the threshold.

It was an unexpected honor to have my work cited in both the court’s decision (written by Justice Kennedy) and the dissenting opinion (written by Justice Alito). My contribution to the argument (relevant portion reproduced here) is a technical one and played an admittedly small role in the proceedings . My main point was that when multiple IQ tests have been administered to the same individual, we should not average the scores but make them into a composite score in the same way that we combine psychological scores in any other context. Doing so gives a more accurate estimate of the IQ and a smaller confidence interval around the score. I hope that the application of this procedure results in fewer incorrect decisions and a fairer administration of justice.

I am grateful to Cecil Reynolds for giving me the opportunity to write the paper and to Kevin McGrew for encouraging me to re-write and publish the argument on the web, demonstrating its application to death penalty cases. Although it was the published chapter that was cited and used by the defense, it was the free web version that initially caught the attention of the law firm representing the defendant.

Standard
Psychometrics

# Reliability is where the light is. Validity is where the keys are.

There is a tired old joke about the drunk who lost his keys on the dark side of the street but is looking for them under the lamppost because “That’s where the light is.”

Reliability is where the light is. Validity is where the keys are.

Reliability is relatively easy to estimate compared to validity. Researchers and test developers make a very big deal out of high reliability coefficients because “A test cannot be valid if it is not reliable.” However, the fact that a measure is highly reliable is irrelevant if it does not allow us to make accurate inferences about the thing we wish to measure. Furthermore, if a measure is shown to have validity, its reliability is already implied.

To switch metaphors, reliability is thin gruel if validity is on the table. I think that with good models such as those offered by Weiss and colleagues (2013), validity is at least on the menu, if not already laid out for the feast. Reliability is at best an appetizer. It is nice to have, but if the main course is ample, you can skip it without worries.

This post is an excerpt from:

Schneider, W. J. (2013). What if we took our models seriously? Estimating latent scores in individuals. Journal of Psychoeducational Assessment, 31, 186–201.

Standard