I have made an easy-to-use Excel spreadsheet that can simulate data according to a latent structure that you specify. You do not need to know anything about R but you’ll need to install it. RStudio is not necessary but it makes life easier. In this video tutorial, I explain how to use the spreadsheet.

This project is still “in beta” so there may still be errors in it. If you find any, let me know.

If you need something with more features and is further along in its development cycle, consider simulating data with the R package simsem.

A new study by MacCann, Joseph, Newman, and Roberts (2013) about the place of emotional intelligence in CHC Theory is worth reading. I highlight some of its findings and discuss other matters in this video:

Conditional normal distributions are really useful in psychological assessment. We can use them to answer questions like:

How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?

If that person also has a score of 80 on a test of working memory capacity, how much does the risk of scoring 90 or lower on reading comprehension increase?

Suppose that variable Y represents reading comprehension test scores. Here is a fancy way of saying Y is normally distributed with a mean of 100 and a standard deviation of 15:

In this notation, “~” means is distributed as, and N means normally distributed with a particular mean (μ) and variance (σ^{2}).

If we know literally nothing about a person from this population, our best guess is that the person’s reading comprehension score is at the population mean. One way to say this is that the person’s expected value on reading comprehension is the population mean:

The 95% confidence interval around this guess is :

Unconditional Normal Distribution with 95% CI

Conditional Normal Distributions

Simple Linear Regression

Now, suppose that we know one thing about the person: the person’s score on a vocabulary test. We can let X represent the vocabulary score and its distribution is the same as that of Y:

If we know that this person scored 120 on vocabulary (X), what is our best guess as to what the person scored on reading comprehension (Y)? This guess is a conditional expected value. It is “conditional” in the sense that the expected value of Y depends on what value X has. The pipe symbol “|” is used to note a condition like so:

This means, “What is our best guess for Y if X is 120?”

What if we don’t want to be specific about the value of X but want to refer to any particular value of X? Oddly enough, it is traditional to use the lowercase x for that. So, X refers to the variable as a whole and x refers to any particular value of variable X. So if I know that variable X happens to be a particular value x, the expected value of Y is:

where ρ_{XY} is the correlation between X and Y.

You might recognize that this is a linear regression formula and that:

where “Y-hat” (Ŷ) is the predicted value of Y when X is known.

Let’s assume that the relationship between X and Y is bivariate normal like in the image at the top of the post:

The first term in the parentheses is the vector of means and the second term (the square matrix in the brackets) is the covariance matrix of X and Y. It is not necessary to understand the notation. The main point is that X and Y are both normal, they have a linear relationship, and the conditional variance of Y at any value of X is the same.

The conditional standard deviation of Y at any particular value of X is:

This is the standard deviation of the conditional normal distribution. In the language of regression, it is the standard error of the estimate (σ_{e}). It is the standard deviation of the residuals (errors). Residuals are simply the amount by which your guesses differ from the actual values.

So,

So, putting all this together, we can answer our question:

How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?

The expected value of Y (Ŷ) is:

Suppose that the correlation is 0.5. Therefore,

This means that among all the people with a vocabulary score of 120, the average is 110 on reading comprehension. Now, how far off from that is 90?

What is the standard error of the estimate?

Dividing the residual by the standard error of the estimate (the standard deviation of the conditional normal distribution) gives us a z-score. It represents how far from expectations this individual is in standard deviation units.

Using the standard normal cumulative distribution function (Φ) gives us the proportion of people scoring 90 or less on reading comprehension (given a vocabulary score of 120).

In Microsoft Excel, the standard normal cumulative distribution function is NORMSDIST. Thus, entering this into any cell will give the answer:

=NORMSDIST(-1.54)

Conditional normal distribution when Vocabulary = 120

Multiple Regression

What proportion of people score 90 or less on reading comprehension if their vocabulary is 120 but their working memory capacity is 80?

Let’s call vocabulary X_{1} and working memory capacity X_{2}. Let’s suppose they correlated at 0.3. The correlation matrix among the predictors (R_{X}):

The validity coefficients are the correlations of Y with both predictors (R_{XY}):

The standardized regression coefficients (β) are:

Unstandardized coefficients can be obtained by multiplying the standardized coefficients by the standard deviation of Y (σ_{Y}) and dividing by the standard deviation of the predictors (σ_{X}):

However, in this case all the variables have the same metric and thus the unstandardized and standardized coefficients are the same.

The vector of predictor means (μ_{X}) is used to calculate the intercept (b_{0}):

The predicted score when vocabulary is 120 and working memory capacity is 80 is:

The error in this case is 90-102.9=-12.9:

The multiple R^{2} is calculated with the standardized regression coefficients and the validity coefficients.

The standard error of the estimate is thus:

The proportion of people with vocabulary = 120 and working memory capacity = 80 who score 90 or less is:

One of the most widely misunderstood statistical concepts is regression to the mean. In this video tutorial, I address common false beliefs about regression to the mean and answer the following questions:

What is regression to the mean?

Do variables become less variable each time they are measured?

Does regression to the mean happen all the time or just in certain situations?

Does repeated testing cause people to come closer and closer to the mean?

How is regression to the mean relevant in death penalty cases?

How to estimate latent scores in individuals when there is a known structural model:

I wrote a commentary in a special issue of the Journal Psychoeducational Assessment. My article proposes a new way to interpret cognitive profiles. The basic idea is to use the best available latent variable model of the tests and then estimate an individual’s latent scores (with confidence intervals around those estimates). I have made two spreadsheets available, one for the WISC-IV and one for the WAIS-IV.

I decided not to provide a spreadsheet for the five-factor model of the WAIS-IV because Gf and g were so highly correlated in that model that it would be nearly impossible to distinguish between Gf and g in individuals. You can think of Gf and g as nearly synonymous (at the latent level).

In this tutorial I discuss and present visual representations of covariance. Although covariance is not directly informative, it is a fundamental ingredient in almost all of the most frequently used statistical procedures.

In this video tutorial, I explain why we have standard scores, why there are so many different kinds of standard scores, and how to convert between any two types of standard scores.