# What if we took our models seriously? Slides from my NASP 2014 talk

WISC-IV Five-Factor Model

I was part of a symposium last week at NASP on the factor structure of the WISC-IV and WAIS-IV. It was organized and moderated by Renée Tobin, who edited the special issue of the Journal of Psychoeducational Assessment on the same topic. The other presenters were Larry Weiss, Tim Keith, Gary Canivez, Joe Kush, Dawn Flanagan, and Vinny Alfonso.

The slides for my talk are here. The text for the talk is written in the “Notes” for each slide.

I have spoken at length on this topic in my article on the special issue and in a companion video.

The spreadsheets that accompany the article are here:

Standard
Psychometrics

# Communicate with percentile ranks…but think and reason with standard scores

Every once in a while, I run into a college student who calculates very basic math facts (e.g., 8 + 5 = 13) by counting on his or her fingers. This method, of course, works perfectly well. Unfortunately, the student who relies upon it is doomed never to master algebra. The act of counting uses up most of the storage space in working memory, often causing miscalculations and attentional lapses (e.g., forgetting where one is in the problem-solving process).

Something similar happens with some professionals who perform psychological assessments. Using percentile ranks is a perfectly reasonable way to communicate where someone’s score falls in a distribution. However, if you think and reason about test scores in terms of percentiles, you will never master the finer points of test interpretation.

The problem is this: No one understands our units of measurement! Therefore we much convert our scores to percentile ranks, which are much easier to understand. This is partly our own fault. Maybe the public would, in time, come to understand our numbers if we used just one kind of measurement unit instead of our usual awful mixture of z-scores, stanines, stens, scaled scores, T-scores, index scores, and normal curve equivalents. Elsewhere I have defended the practice of using different units of measurement. From the ivory tower, the Tower of Babel looks magnificent! From the ground, it looks like a big mess!

Unfortunately, standard scores are just as unfamiliar to new graduate students as they are to the public. A certain percentage of them will persist in using percentile ranks as they reason with test scores. This will work reasonably well in most cases but at the extremes (where precision matters most), it can cause interpretative errors.

Percentiles are not easily compared. When we look at scores that differ considerably within a profile, percentiles can make large differences look small and small differences look large. Consider how much space the there is in the normal distribution between the 1st and 2nd percentile and how little there is between the 50th and 51st percentile. The meaning of percentiles is similarly inconsistent with other distributions (other than the uniform distribution).

Percentiles in a standard normal distribution

Comparing scores is just the beginning of the problem with percentile ranks. Almost every kind of calculation (e.g., to predict performance) must be done with standard scores, not with percentile ranks. Many calculations are done rapidly (and approximately) in our heads. Mastering the art of rapid and fluent test interpretation requires the ability to think in terms of standard scores. If you think about scores in terms of percentile ranks, you are counting on your fingers.

Standard

# Why averaging multiple IQ scores is incorrect in death penalty cases

As I have explained elsewhere on this blog, when a person has been given multiple IQ tests, it is common practice to take the mean IQ or median IQ to determine eligibility for the death penalty. As long as all the scores are valid estimates, combining multiple scores results in more accurate measurement.

Unfortunately, taking the mean or median IQ score is one of those solutions that is simple, neat, and wrong. Why? In the graph below, there are two IQ tests that correlate at 0.9. On each test, the population mean is μ = 100 and the standard deviation is σ = 15. On either test alone, about 2.3% of people score 70 or less, the typical threshold at which a person is ineligible for the death penalty.

What percent of people score 70 or less on the average of the 2 tests? About 2%. Why is it 2% instead of 2.3%? The smaller number occurs because the tests, though highly correlated, are not perfectly correlated. The average of the 2 tests has population mean of μ = 100 but its standard deviation is smaller than 15. In this case, the standard deviation is σ = 14.62. The fact that the standard deviation of the average of two scores is smaller results in fewer people below the threshold of 70 than is the case if just one test had been given.

There is an established procedure for rescaling a composite score so that it has the correct mean and standard deviation. It is the same procedure that was applied to the IQ subtest scores in the calculation of the full scale IQ. This same procedure should be applied when multiple IQ scores have been given.

Assuming that all the IQ scores have a mean of μ = 100 and a standard deviation of σ = 15, the composite IQ of k scores is:

$\text{Composite IQ}=\dfrac{\text{Sum of the IQ scores}-100k}{\sqrt{\text{Sum of the correlation matrix}}}+100$

In the graph above, the diagonal axis represents the composite IQ with the proper scaling so that the composite IQ has a mean of 100 and a standard deviation of 15 (instead of 14.62). As stated previously, if the 2 IQ tests were simply averaged, only about 2.0% score 70 or less. On a properly scaled IQ score, 2.0% corresponds to an IQ of 69.

Does 1 point matter? It does to the person who on average scored 71 on the 2 IQ tests. That person, with the score properly rescaled, would have a composite IQ of  70 and thus would be deemed ineligible for execution.

Your intuition might be telling you that something is fishy about all this. Does this mean that whenever someone scores 71 on an IQ test, just missing the threshold, that another test should be given, resulting in another score of 71 so that the composite score is 70? The answer is that your intuition (and mine) is often unreliable when it comes to probability. As I have explained in this video, most people who score 71 on one IQ test score higher than 71 on a second IQ test. As long as all the scores are properly rescaled, the composite IQ is more accurate and nothing fishy is happening.

This procedure should not be applied mechanically in all situations. The method assumes that each score is equally valid and thus has equal weight. There are reasons to prefer some IQ administrations over others (e.g., a full battery given by a licensed clinician is likely to be more accurate than an abbreviated IQ test given by a first-year graduate student). If there are reasons to dismiss a particular score (e.g., the evaluee intentionally tried to obtain a low score), it should not figure into the composite score. There are further complications not discussed here such as the fact that people tend to score higher when retested with the same test (or one that is very similar).

Standard

# Conditional normal distributions provide useful information in psychological assessment

Conditional Normal Distribution

Conditional normal distributions are really useful in psychological assessment. We can use them to answer questions like:

• How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?
• If that person also has a score of 80 on a test of working memory capacity, how much does the risk of scoring 90 or lower on reading comprehension increase?

What follows might be mathematically daunting. Just let it wash over you if it becomes confusing. At the end, there is a video in which I will show how to use a spreadsheet that will do all the calculations.

# Unconditional Normal Distributions

Suppose that variable Y represents reading comprehension test scores. Here is a fancy way of saying Y is normally distributed with a mean of 100 and a standard deviation of 15:

$Y\sim N(100,15^2)$

In this notation, “~” means is distributed as, and N means normally distributed with a particular mean (μ) and variance (σ2).

If we know literally nothing about a person from this population, our best guess is that the person’s reading comprehension score is at the population mean. One way to say this is that the person’s expected value on reading comprehension is the population mean:

$E(Y)=\mu_Y = 100$

The 95% confidence interval around this guess is :

$95\%\, \text{CI} = \mu_Y \pm z_{95\%} \sigma_Y$

$95\%\, \text{CI} \approx 100 \pm 1.96*15 = 70.6 \text{ to } 129.4$

Unconditional Normal Distribution with 95% CI

# Conditional Normal Distributions

## Simple Linear Regression

Now, suppose that we know one thing about the person: the person’s score on a vocabulary test. We can let X represent the vocabulary score and its distribution is the same as that of Y:

$X\sim N(100,15^2)$

If we know that this person scored 120 on vocabulary (X), what is our best guess as to what the person scored on reading comprehension (Y)? This guess is a conditional expected value. It is “conditional” in the sense that the expected value of Y depends on what value X has. The pipe symbol “|” is used to note a condition like so:

$E(Y|X=120)$

This means, “What is our best guess for Y if X is 120?”

What if we don’t want to be specific about the value of X but want to refer to any particular value of X? Oddly enough, it is traditional to use the lowercase x for that. So, X refers to the variable as a whole and x refers to any particular value of variable X. So if I know that variable X happens to be a particular value x, the expected value of Y is:

$E(Y|X=x)=\sigma_Y \rho_{XY}\dfrac{x-\mu_X}{\sigma_X}+\mu_Y$

where ρXY is the correlation between X and Y.

You might recognize that this is a linear regression formula and that:

$E(Y|X=x)=\hat{Y}$

where “Y-hat” (Ŷ) is the predicted value of Y when X is known.

Let’s assume that the relationship between X and Y is bivariate normal like in the image at the top of the post:

$\begin{bmatrix}X\\Y\end{bmatrix}\sim N\left(\begin{bmatrix}\mu_X\\ \mu_Y\end{bmatrix}\begin{matrix} \\,\end{matrix}\begin{bmatrix}\sigma_X^2&\rho_{XY}\sigma_X\sigma_Y\\ \rho_{XY}\sigma_X\sigma_Y&\sigma_X^2\end{bmatrix}\right)$

The first term in the parentheses is the vector of means and the second term (the square matrix in the brackets) is the covariance matrix of X and Y. It is not necessary to understand the notation. The main point is that X and Y are both normal, they have a linear relationship, and the conditional variance of Y at any value of X is the same.

The conditional standard deviation of Y at any particular value of X is:

$\sigma_{Y|X=x}=\sigma_Y\sqrt{1-\rho_{xy}^2}$

This is the standard deviation of the conditional normal distribution. In the language of regression, it is the standard error of the estimate (σe). It is the standard deviation of the residuals (errors). Residuals are simply the amount by which your guesses differ from the actual values.

$e = y - E(Y|X=x)=y-\hat{Y}$

So,

$\sigma_{Y|X=x}=\sigma_e$

So, putting all this together, we can answer our question:

How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?

The expected value of Y (Ŷ) is:

$E(Y|X=120)=15\rho_{XY}\dfrac{120-100}{15}+100$

Suppose that the correlation is 0.5. Therefore,

$E(Y|X=120)=15*0.5\dfrac{120-100}{15}+100=110$

This means that among all the people with a vocabulary score of 120, the average is 110 on reading comprehension. Now, how far off from that is 90?

$e= y - \hat{Y}=90-110=-20$

What is the standard error of the estimate?

$\sigma_{Y|X=x}=\sigma_e=\sigma_Y\sqrt{1-\rho_{xy}^2}$

$\sigma_{Y|X=x}=\sigma_e=15\sqrt{1-0.5^2}\approx12.99$

Dividing the residual by the standard error of the estimate (the standard deviation of the conditional normal distribution) gives us a z-score. It represents how far from expectations this individual is in standard deviation units.

$z=\dfrac{e}{\sigma_e} \approx\dfrac{-20}{12.99}\approx -1.54$

Using the standard normal cumulative distribution function (Φ) gives us the proportion of people scoring 90 or less on reading comprehension (given a vocabulary score of 120).

$\Phi(z)\approx\Phi(-1.54)\approx 0.06$

In Microsoft Excel, the standard normal cumulative distribution function is NORMSDIST. Thus, entering this into any cell will give the answer:

=NORMSDIST(-1.54)

Conditional normal distribution when Vocabulary = 120

## Multiple Regression

What proportion of people score 90 or less on reading comprehension if their vocabulary is 120 but their working memory capacity is 80?

Let’s call vocabulary X1 and working memory capacity X2. Let’s suppose they correlated at 0.3. The correlation matrix among the predictors (RX):

$\mathbf{R_X}=\begin{bmatrix}1&\rho_{12}\\ \rho_{12}&1\end{bmatrix}=\begin{bmatrix}1&0.3\\ 0.3&1\end{bmatrix}$

The validity coefficients are the correlations of Y with both predictors (RXY):

$\mathbf{R}_{XY}=\begin{bmatrix}\rho_{Y1}\\ \rho_{Y2}\end{bmatrix}=\begin{bmatrix}0.5\\ 0.4\end{bmatrix}$

The standardized regression coefficients (β) are:

$\pmb{\mathbf{\beta}}=\mathbf{R_{X}}^{-1}\mathbf{R}_{XY}\approx\begin{bmatrix}0.418\\ 0.275\end{bmatrix}$

Unstandardized coefficients can be obtained by multiplying the standardized coefficients by the standard deviation of Y (σY) and dividing by the standard deviation of the predictors (σX):

$\mathbf{b}=\sigma_Y\pmb{\mathbf{\beta}}/\pmb{\mathbf{\sigma}}_X$

However, in this case all the variables have the same metric and thus the unstandardized and standardized coefficients are the same.

The vector of predictor means (μX) is used to calculate the intercept (b0):

$b_0=\mu_Y-\mathbf{b}' \pmb{\mathbf{\mu}}_X$

$b_0\approx 100-\begin{bmatrix}0.418\\ 0.275\end{bmatrix}^{'} \begin{bmatrix}100\\ 100\end{bmatrix}\approx 30.769$

The predicted score when vocabulary is 120 and working memory capacity is 80 is:

$\hat{Y}=b_0 + b_1 X_1 + b_2 X_2$

$\hat{Y}\approx 30.769+0.418*120+0.275*80\approx 102.9$

The error in this case is 90-102.9=-12.9:

The multiple R2 is calculated with the standardized regression coefficients and the validity coefficients.

$R^2 = \pmb{\mathbf{\beta}}'\pmb{\mathbf{R}}_{XY}\approx\begin{bmatrix}0.418\\ 0.275\end{bmatrix}^{'} \begin{bmatrix}0.5\\ 0.4\end{bmatrix}\approx0.319$

The standard error of the estimate is thus:

$\sigma_e=\sigma_Y\sqrt{1-R^2}\approx 15\sqrt{1-0.319^2}\approx 12.38$

The proportion of people with vocabulary = 120 and working memory capacity = 80 who score 90 or less is:

$\Phi\left(\dfrac{e}{\sigma_e}\right)\approx\Phi\left(\dfrac{-12.9}{12.38}\right)\approx 0.15$

Standard

# Guttman’s Radex Model of Intelligence

Louis Guttman formulated the first radex model of intelligence, which he revised often. In his last update, Guttman and Levy (1991) distinguished among different kinds of test content modalities (verbal, numerical, geometrical), different kinds of mental processes: rule induction, rule application, rule practice (learning), and different modes of expression in examinee performance: oral, manual manipulative, paper and pencil. Tests of any modality correlate highly if they involve rule induction but with rule application and rule practice, the tests tend to have lower correlations across test modality.

I have adapted one of Guttman & Levy’s figures below. Technically it is a cylindrex (a series of stacked radexes):

It is possible to represent the radex model as a path diagram. It has the advantage of being more precise in some ways but it fails to show that not all facets behave in the same way.

Guttman and Levy used the WISC-R but these findings were replicated and extended using the WISC-IV in a study by Cohen, Fiorello, and Farley, 2007.

Many other kinds of radex models have been proposed but probably the most influential has been that of Marshalek, Snow, & Lohman (1983). The center of the radex is conceptualized as “cognitive complexity” and corresponds to the g factor.

Standard
Psychometrics

# Reliability coefficients are for squares. Confidence interval widths tell it to you straight.

The more reliable a score is, the more certain we can be about what it means (provided its validity is close to its reliability). Certain rules-of-humb about score reliability are sometimes proposed:

• Base high-stakes decisions only on scores with reliability coefficients of 0.98 or better.
• Base substantive interpretations on scores with reliability coefficients of 0.90 or better.
• Base decisions to give more tests or not on scores with reliability coefficients of 0.80 or more.

Such guidelines seem reasonable to me, but I do not find reliability coefficients to be intuitively easy to understand. How much uncertainty is associated with a reliability coefficient of 0.80? The value of the coefficient (0.80) is not directly informative about individual scores. Instead, it refers to the correlation the scores have with a (more often than not hypothetical) repeated measurement.

Another way to think about the reliability coefficient is that it is a ratio of true score variance to observed score variance. Variance is the average squared deviation from the mean. Squared quantities are not easy to think about for most of us. For this reason, I prefer to convert reliability coefficients into confidence interval widths. Confidence interval widths and reliability coefficients have a non-linear relationship:

$\text{CI Width}=2z_{\text{CI\%}}\sigma_{x}\sqrt{r_{xx}-r^2_{xx}}$

Where:

$z_{\text{CI\%}}$ is the z-score associated with the level of confidence you want (e.g., 1.96 for a 95% confidence interval)

$\sigma_{x}$ is the standard deviation of X

$r_{xx}$ is the classical test theory reliability coefficient for X

For index scores (μ = 100, σ = 15), a reliability coefficient of 0.80 is associated with a 95% confidence interval that is 24 points wide. That to me is much more informative than knowing that 80% of the variance is reliable.

Calculating a lower and upper bounds of a confidence interval for a score looks complex with all the symbols and subscripts, but after doing it a few times, it is not so bad. Basically, you are converting your score to a z-score, multiplying it by the reliability coefficient, and then adding (or subtracting) the margin of error, then converting everything back to the original metric.

$\text{CI} = \sigma_x(r_{xx}\frac{X-\mu_x}{\sigma_x} \pm z_{\text{CI\%}}\sqrt{r_{xx}-r^2_{xx}}\,)+\mu_x$

The animated graph below shows the non-linear relationship between reliability and 95% confidence interval widths for different observed index scores. The confidence interval width narrows slowly at first and then quickly as the reliability coefficient approaches 1.

Paying close attention to confidence intervals allows you to do away with rough rules-of-thumb about reliability and make more direct and accurate interpretations about individual scores.

Standard

# A Gentle, Non-Technical Introduction to Factor Analysis

When measuring characteristics of physical objects, there may be some disagreement about the best methods to use but there is little disagreement about which dimensions are being measured. We know that we are measuring length when we use a ruler and we know that we are measuring temperature when we use a thermometer. It is true that heating some materials makes them expand but we are virtually never confused about whether heat and length represent distinct dimensions that are independent of each other. That is, they are independent of each other in the sense that things can be cold and long, cold and short, hot and long, or hot and short.

Unfortunately, we are not nearly as clear about what we are measuring when we attempt to measure psychological dimensions such as personality traits, motivations, beliefs, attitudes, and cognitive abilities. Psychologists often disagree not only about what to name these dimensions but also about how many dimensions there are to measure. For example, you might think that there exists a personality trait called niceness. Another person might disagree with you, arguing that niceness is a vague term that lumps together 2 related but distinguishable traits called friendliness and kindness. Another person could claim that kindness is too general and that we must separate kindness with friends from kindness with strangers.

As you might imagine, these kinds of arguments can quickly lead to hypothesizing the existence of as many different traits as our imaginations can generate. The result would be a hopeless confusion among psychological researchers because they would have no way to agree on what to measure so that they can build upon one another’s findings. Fortunately, there are ways to put some limits on the number of psychological dimensions and come to some degree of consensus about what should be measured. One of the most commonly used of such methods is called factor analysis.

Although the mathematics of factor analysis is complicated, the logic behind it is not difficult to understand. The assumption behind factor analysis is that things that co-occur tend to have a common cause (but not always). For example, fevers, sore throats, stuffy noses, coughs, and sneezes tend to occur at roughly the same time in the same person. Often, they are caused by the same thing, namely, the virus that causes the common cold. Note that although the virus is one thing, its manifestations are quite diverse. In psychological assessment research, we measure a diverse set of abilities, behaviors and symptoms and attempt to deduce which underlying dimensions cause or account for the variations in behavior and symptoms we observe in large groups of people. We measure the relations between various behaviors, symptoms, and test scores with correlation coefficients and use factor analysis to discover patterns of correlation coefficients that suggest the existence of underlying psychological dimensions.

All else being equal, a simple theory is better than a complicated theory. Therefore, factor analysis helps us discover the smallest number of psychological dimensions (i.e., factors) that can account for the correlation patterns in the various behaviors, symptoms, and test scores we observe. For example, imagine that we create 4 different tests that would measure people’s knowledge of vocabulary, grammar, arithmetic, and geometry. If the correlations between all of these tests were 0 (i.e., high scorers on one test are no more likely to score high on the other tests than low scorers), then the factor analysis would suggest to us that we have measured 4 distinct abilities. In the picture below, the correlations between all the tests are displayed in the table. Below that, the theoretical model that would be implied is that there are 4 abilities (shown as ellipses) that influence performance on 4 tests (shown as rectangles). The numbers beside the arrows imply that the abilities and the tests have high but imperfect correlations of 0.9.Of course, you probably recognize that it is very unlikely that the correlations between these tests would be 0. Therefore, imagine that the correlation between the vocabulary and grammar tests is quite high (i.e., high scorers on vocabulary are likely to also score high on grammar and low scorers on vocabulary are likely to score low on grammar). The correlation between arithmetic and geometry is high also. Furthermore, the correlations between the language tests and the mathematics tests is 0. Factor analysis would suggest that we have measured not 4 distinct abilities but rather 2 abilities. Researchers interpreting the results of the factor analysis would have to use their best judgment to decide what to call these 2 abilities. In this case, it would seem reasonable to call them language ability and mathematical ability. These 2 abilities (shown below as ellipses) influence performance on 4 tests (shown as rectangles).Now imagine that the correlations between all 4 tests is equally high. That is, for example, vocabulary is just as strongly correlated with geometry as it is with grammar. In this case, factor analysis would suggest that the simplest explanation for this pattern of correlations is that there is just 1 factor that causes all of these tests to be equally correlated. We might call this factor general academic ability.In reality, if you were to actually measure these 4 abilities, the results would not be so clear. It is likely that all of the correlations would be positive and substantially above 0. It is also likely that the language subtests would correlate more strongly with each other than with the mathematical subtests. In such a case, factor analysis would suggest that language and mathematical abilities are distinct but not entirely independent from each other. That is, language abilities and mathematics abilities are substantially correlated with each other, suggesting that a general academic (or intellectual) ability influences performance in all academic areas. In this model, abilities are arranged in hierarchies with general abilities influencing narrow abilities.

# Exploratory Factor Analysis

Factor analysis can help researchers decide how best to summarize large amounts of information about people using just a few scores. For example, when we ask parents to complete questionnaires about behavior problems their children might have, the questionnaires can have hundreds of items. It would take too long and would be too confusing to review every item. Factor analysis can simplify the information while minimizing the loss of detail. Here is an example of a short questionnaire that factor analysis can be used to summarize.

On a scale of 1 to 5, compared to other children his or her age, my child:

1. gets in fights frequently at school
3. is very impulsive
4. has stomachaches frequently
5. is anxious about many things
6. appears sad much of the time

If we give this questionnaire to a large, representative sample of parents, we can calculate the correlations between the items:

1 2 3 4 5 6
1. gets in fights frequently at school
2. is defiant to adults .81
3. is very impulsive .79 .75
4. has stomachaches frequently .42 .38 .36
5. is anxious about many things .39 .34 .34 .77
6. appears sad much of the time .37 .34 .32 .77 .74

Using this set of correlation coefficients, factor analysis suggests that there are 2 factors being measured by this behavior rating scale. The logic of factor analysis suggests that the reason items 1-3 have high correlations with each other is that each of them has a high correlation with the first factor. Similarly, items 4-6 have high correlations with each other because they have high correlations with the second factor. The correlations that the items have with the hypothesized factors are called factor loadings. The factor loadings can be seen in the chart below:

 Factor 1 Factor 2 1. gets in fights frequently at school .91 .03 2. is defiant to adults .88 -.01 3. is very impulsive .86 -.01 4. has stomachaches frequently .02 .89 5. is anxious about many things .01 .86 6. appears sad much of the time -.02 .87

Factor analysis tells us which items “load” on which factors but it cannot interpret the meaning of the factors. Usually researchers look at all of the items that load on a factor and use their intuition or knowledge of theory to identify what the items have in common. In this case, Factor 1 could receive any number of names such as Conduct Problems, Acting Out, or Externalizing Behaviors. Likewise, Factor 2 could be called Mood Problems, Negative Affectivity, or Internalizing Behaviors. Thus, the problems on this behavior rating scale can be summarized fairly efficiently with just 2 scores. In this example, a reduction of 6 scores to 2 scores may not seem terribly useful. In actual behavior rating scales, factor analysis can reduce the overwhelming complexity of hundreds of different behavior problems to a more manageable number of scores that help professionals more easily conceptualize individual cases.

It should be noted that factor analysis also calculates the correlation among factors. If a large number of factors are identified and there are substantial correlations (i.e., significantly larger than 0) among factors, this new correlation matrix can be factor analyzed also to obtain second-order factors. These factors, in turn, can be analyzed to obtain third-order factors. Theoretically, it is possible to have even higher order factors but most researchers rarely find it necessary to go beyond third-order factors. The g-factor from intelligence test data is an example of a third-order factor that emerges because all tests of cognitive abilities are positively correlated. In our example above, the 2 factors have a correlation of .46, suggesting that children who have externalizing problems are also at risk of having internalizing problems. It is therefore reasonable to calculate a second-order factor score that measures the overall level of behavior problems.

This example illustrates the most commonly used type of factor analysis: exploratory factor analysis. Exploratory factor analysis is helpful when we wish to summarize data efficiently, we are not sure how many factors are present in our data, or we are not sure which items load on which factors.

# Confirmatory Factor Analysis

Confirmatory factor analysis is a method that researchers can use to test highly specific hypotheses. For example, a researcher might want to know if the 2 different types of items on the WISC-IV Digit Span subtest measures the same ability or 2 different abilities. On the Digits Forward type of item, the child must repeat a string of digits in the same order in which they were heard. On the Digits Backward type of item, the child must repeat the string of digits in reverse order. Some researchers believe that repeating numbers verbatim measures auditory short-term memory storage capacity and that repeating numbers in reverse order measures executive control, the ability to allocate attentional resources efficiently to solve multi-step problems. Typically, clinicians add the raw scores of both types of items to produce a single score. If the 2 item types measure different abilities, adding the raw scores together is like adding apples and orangutans. If, however, they measure the same ability, adding the scores together is valid and will produce a more reliable score than using separate scores.

To test this hypothesis, we can use confirmatory factor analysis to see if the 2 item types measure different abilities. We would need to identify or invent several tests that are likely to measure the 2 separate abilities that we believe are measured by the 2 types of Digit Span items. Usually, using 3 tests per factor is sufficient.

Next, we specify the hypotheses, or models, we wish to test:

1. All of the tests measure the same ability. A graphical representation of a hypothesis in confirmatory factor analysis is called a path diagram. Tests are drawn with rectangles and hypothetical factors are drawn with ovals. The correlations between tests and factors are drawn with arrows. The path diagram for this hypothesis would look like this:

1. Both Digits Forward and Digits Backward measure short-term memory storage capacity and are distinct from executive control. The path diagram would look like this (the curved arrow allows for the possibility that the 2 factors might be correlated):

1. Digits Forward and Digits Backward measure different abilities. The path diagram would look like this:

Confirmatory factor analysis produces a number of statistics, called fit statistics that tell us which of the models or hypotheses we tested are most in agreement with the data. Studying the results, we can select the best model or perhaps generate a new model if none of them provide a good “fit” with the data. With structural equation modeling, a procedure that is very similar to confirmatory factor analysis, we can test extremely complex hypotheses about the structure of psychological variables.

This post is a revised version of a tutorial I originally prepared for Cohen & Swerdlik’s Psychological Testing and Assessment: An Introduction To Tests and Measurement

Standard

# g Factor Removed from Correlation Matrices, Vizualized

As a follow up to yesterday’s post, I extracted a g factor from the matrix of each battery and made these pictures of the residual matrices. I filtered out all the negative residuals to de-clutter the image.

I am not sure what can be learned from such pictures other than getting a sense of the magnitudes of the the differences in strength of the different factors. You can see that Gc is generally much stronger than the other factors (except in the case of the SB5).

Standard

# Correlation Matrices from Five Cognitive Ability Tests, Visualized

Sometimes it is interesting to look at something familiar in a new way. Here are the correlations among the subtests of five major cognitive ability batteries (data comes from the standardization samples). Stronger correlations are thicker and darker. What do you see?