]]>

Schneider , W. J. (2016). *Why Are WJ IV Cluster Scores More Extreme Than the Average of Their Parts? A Gentle Explanation of the Composite Score Extremity Effect* (Woodcock-Johnson IV Assessment Service Bulletin No. 7). Itasca, IL: Houghton Mifflin Harcourt.

I thank Mark Ledbetter for the invitation to write the paper and support in the writing process, Erica LaForte for patiently editing a complex first draft down to a much more readable version, and Kevin McGrew for additional thoughtful comments and suggestions for improvement on the first draft.

The bulk of the paper is not mathematical. However, the first draft had a few bells and whistles like the animated graph below that shows how the composite score extremity effect is larger as the average correlation among the tests decreases and the number of tests in the composite increases.

Another plot that was originally animated shows what our best guess of a latent variable *X* if we have two indicators *X*_{1} and *X*_{2} that are both exactly 2 standard deviations below the mean. *X*_{1} and *X*_{2} correlate with each other at 0.64 and with *X* at 0.8. If we only know that *X*_{1} = −2, our best guess is that *X* is −1.60. If we know that both *X*_{1} and *X*_{2} are −2, out best guess is that *X* is −1.95. Thus, our estimate is lower with 2 scores (−1.95) than with one score (−1.60).

]]>

No matter which tests I have given, I would like to be able to combine them into theoretically valid composite scores. For example, on the WISC-V, the Verbal Comprehension Index (VCI) consists of two subtest scores, Vocabulary and Similarities. However, the Information and Comprehension subtests measure verbal knowledge just as well as the other two tests. We should be able to combine them with the two VCI subtests to make a more valid estimate of verbal knowledge.

The good news is that the WISC-V now allows us to do just that: It now has two expanded composite scores:

- Verbal Expanded Crystallized Index (VECI)
- Similarities
- Vocabulary
- Information
- Comprehension

- Expanded Fluid Index (EFI)
- Matrix Reasoning
- Picture Concepts
- Figure Weights
- Arithmetic

At the risk of sounding greedy, I would like to have an expanded working memory index (Digit Span, Picture Span, and Letter-Number Sequencing) and an expanded processing speed index (Coding, Symbol Search, and Cancellation). Even so, I am grateful for this improvement in the WISC-V.

]]>

The authors of this study ask whether children with learning disorders have the same structure of intelligence as children in the general population. This might seem like an important question, but it is not—if the difference in structure is embedded in the very definition of learning disorders.

Imagine that a highly respected medical journal published a study titled *Tall People Are Significantly Greater in Height than People in the General Population*. Puzzled and intrigued, you decide to investigate. You find that the authors solicited medical records from physicians who labelled their patients as tall. The primary finding is that such patients have, on average, greater height than people in the general population. The authors speculate that the instruments used to measure height may be less accurate for tall people and suggest alternative measures of height for them.

This imaginary study is clearly ridiculous. No researcher would publish such a “finding” because it is not a finding. People who are tall have greater height than average by definition. There is no reason to suppose that the instruments used were inaccurate.

It is not so easy to recognize that Giofrè and Cornoldi applied the same flawed logic to children with learning disorders and the structure of intelligence. Their primary finding is that in a sample of Italian children with clinical diagnoses of specific learning disorder, the four index scores of the WISC-IV have lower *g*-loadings than they do in the general population in Italy. The authors believe that this result implies that alternative measures of intelligence might be more appropriate than the WISC-IV for children with specific learning disorders.

What is the problem with this logic? The problem is that the WISC-IV was one of the tools used to diagnose the children in the first place. Having unusual patterns somewhere in one’s cognitive profile is part of the traditional definition of learning disorders. If the structure of intelligence were the same in this group, we would wonder if the children had been properly diagnosed. This is not a “finding” but an inevitable consequence of the traditional definition of learning disorders. Had the same study been conducted with any other cognitive ability battery, the same results would have been found.

A diagnosis of a learning disorder is often given when a child of broadly average intelligence has low academic achievement due to specific cognitive processing deficits. To have specific cognitive processing deficits, there must be a one or more specific cognitive abilities that are low compared to the population and also to the child’s other abilities. For example, in the profile below, the WISC-IV Processing Speed Index of 68 is much lower than the other three WISC-IV index scores, which are broadly average. Furthermore, the low processing speed score is a possible explanation of the low Reading Fluency score.

The profile above is unusual. The Processing Speed (PS) score is unexpectedly low compared to the other three index scores. This is just one of many unusual score patterns that clinicians look for when they diagnose specific learning disorders. When we gather together all the unusual WISC-IV profiles in which at least one score is low but others are average or better, it comes as no surprise that the structure of the scores in the sample is unusual. Because the scores are unusually scattered, they are less correlated, which implies lower *g*-loadings.

Suppose that the WISC-IV index scores have the correlations below (taken from the U.S. standardization sample, age 14).

VC | PR | WM | PS | |
---|---|---|---|---|

VC | 1.00 | 0.59 | 0.59 | 0.37 |

PR | 0.59 | 1.00 | 0.48 | 0.45 |

WM | 0.59 | 0.48 | 1.00 | 0.39 |

PS | 0.37 | 0.45 | 0.39 | 1.00 |

Now suppose that we select an “LD” sample from the general population all scores in which

- At least one score is less than 90.
- The remaining scores are greater than 90.
- The average of the three highest scores is at least 15 points higher than the lowest score.

Obviously, LD diagnosis is more complex than this. The point is that we are selecting from the general population a group of people with unusual profiles and observing that the correlation matrix is different in the selected group. Using the R code at the end of the post, we see that the correlation matrix is:

VC | PR | WM | PS | |
---|---|---|---|---|

VC | 1.00 | 0.15 | 0.18 | −0.30 |

PR | 0.15 | 1.00 | 0.10 | −0.07 |

WM | 0.18 | 0.10 | 1.00 | −0.20 |

PS | −0.30 | −0.07 | −0.20 | 1.00 |

A single-factor confirmatory factor analysis of the two correlation matrices reveals dramatically lower *g*-loadings in the “LD” sample.

Whole Sample | “LD” Sample | |
---|---|---|

VC | 0.80 | 0.60 |

PR | 0.73 | 0.16 |

WM | 0.71 | 0.32 |

PS | 0.53 | −0.51 |

Because the PS factor has the lowest *g*-loading in the whole sample, it is most frequently the score that is out of sync with the others and thus is negatively correlated with the other tests in the “LD” sample.

In the paper referenced above, the reduction in *g*-loadings was not nearly as severe as in this demonstration, most likely because clinicians frequently observe specific processing deficits in tests outside the WISC. Thus many people with learning disorders have perfectly normal-looking WISC profiles; their deficits lie elsewhere. A mixture of ordinary and unusual WISC profiles can easily produce the moderately lowered *g*-loadings observed in the paper.

In general, one cannot select a sample based on a particular measure and then report as an empirical finding that the sample differs from the population on that same measure. I understand that in this case it was not immediately obvious that the selection procedure would inevitably alter the correlations among the WISC-IV factors. It is clear that the authors of the paper submitted their research in good faith. However, I wish that the reviewers had noticed the problem and informed the authors that the paper was fundamentally flawed. Therefore, this study offers no valid evidence that casts doubt on the appropriateness of the WISC-IV for children with learning disorders. The same results would have occurred with any cognitive battery, including those recommended by the authors as alternatives to the WISC-IV.

```
# Correlation matrix from U.S. Standardization sample, age 14
WISC <- matrix(c(
1,0.59,0.59,0.37, #VC
0.59,1,0.48,0.45, #PR
0.59,0.48,1,0.39, #WM
0.37,0.45,0.39,1), #PS
nrow= 4, byrow=TRUE)
colnames(WISC) <- rownames(WISC) <- c("VC", "PR", "WM", "PS")
#Set randomization seed to obtain consistent results
set.seed(1)
# Generate data
x <- as.data.frame(mvtnorm::rmvnorm(100000,sigma=WISC)*15+100)
colnames(x) <- colnames(WISC)
# Lowest score in profile
minSS <- apply(x,1,min)
# Mean of remaining scores
meanSS <- (apply(x,1,sum) - minSS) / 3
# LD sample
xLD <- x[(meanSS > 90) & (minSS < 90) & (meanSS - minSS > 15),]
# Correlation matrix of LD sample
rhoLD <- cor(xLD)
# Load package for CFA analyses
library(lavaan)
# Model for CFA
m <- "g=~VC + PR + WM + PS"
# CFA for whole sample
summary(sem(m,x),standardized=TRUE)
# CFA for LD sample
summary(sem(m,xLD),standardized=TRUE)
```

]]>

This study gives the unwarranted impression that it is a disservice to children with autism to use the WISC-IV. Let me be clear—I want to be helpful to children with autism. I certainly do not wish to do anything that hurts anyone. A naive reading of this article leads us to believe that there is an easy way to avoid causing harm (i.e., use the Raven’s Progressive Matrices test instead of the WISC-IV). In my opinion, acting on this advice does no favors to children with autism and may even result in harm.

Based on the evidence presented in the study, the average score differences between children with and without autism is smaller on Raven’s Progressive Matrices (RPM) and larger on the WISC-IV. The rhetoric of the introduction leaves the reader with the impression that the RPM is a better test of intelligence than the WISC-IV. Once we accept this, it is easy to discount the results of the WISC-IV and focus primarily on the RPM.

There is a seductive undercurrent to the argument: If you advocate for children with autism, don’t you want to show that they are more intelligent rather than less intelligent? Yes, of course! Doesn’t it seem harmful to give a test that will show that children with autism are less intelligent? It certainly seems so!

Such rhetoric reveals a fundamental misunderstanding of what individual intelligence tests like the WISC-IV are designed to do. In the vast majority of settings, they are not for *certifying* how intelligent a person is (whatever that means!). Their primary purpose is to help psychologists understand what a person can and cannot do. They are designed to help explain what is easy and what is difficult for a person so that appropriate interventions can be selected.

The WISC-IV provides a Full Scale IQ, which gives an overall summary of cognitive functions. However, it also gives more detailed information about various aspects of ability. Here is a graph I constructed from Figure 1 in the paper. In my graph, I converted percentiles to index scores and rearranged the order of the scores to facilitate interpretation.

It is clear that the difference between the two groups of children is small for the RPM. It is also clear that the difference is also small for the WISC-IV Perceptual Reasoning Index (PRI). Why is this? The RPM and the PRI are both nonverbal measures of logical reasoning (AKA *fluid intelligence*). Both the WISC-IV and the RPM tell us that, on average, children with autism perform relatively well in this domain. The RPM is a great test, but it has no more to tell us. In contrast, the WISC-IV not only tells us what children with autism, on average, do relatively well, but also what they typically have difficulty with.

It is no surprise that the largest difference is in the Verbal Comprehension Index (VCI), a measure of verbal knowledge and language comprehension. Communication problems are a major component of the definition of autism. If children with autism had performed equally well on the VCI, we would wonder whether the VCI was really measuring what it was supposed to measure. Note that I am not saying that a low score on VCI is a requirement for the diagnosis of autism or that the VCI is the best measure of the kinds of language problems that are characteristic of autism. Rather, I am saying that children with autism, on average, have difficulties with language comprehension and that this difference is manifest to some degree in the WISC-IV scores.

The WISC-IV scores also suggest that, on average, children with autism not only have lower scores in verbal knowledge and comprehension, they are more likely to have other cognitive deficits, including in verbal working memory (as measured by the WMI) and information processing speed (as measured by the PSI).

Thus, as a clinical instrument, the WISC-IV performs its purpose reasonably well. Compared to the RPM, it gives a more complete picture of the kinds of cognitive strengths and weaknesses that are common in children with autism.

If the researchers wish to demonstrate that the WISC-IV truly underestimates the intelligence of children with autism, they would need to show that it underpredicts important life outcomes among this population. For example, suppose we compare children with and without autism who score similarly low on the WISC-IV. If the WISC-IV underestimated the intelligence of children with autism, they would be expected to do better in school than the low-scoring children without autism. Obviously, a sophisticated analysis of this matter would involve a more complex research design, but in principle this is the kind of result that would be needed to show that the WISC-IV is a poor measure of cognitive abilities for children with autism.

]]>

It was an honor to be invited to participate, and it was a pleasure to be paired to work with Daniel Newman of the University of Illinois at Urbana/Champaign. Together we wrote an I/O psychology-friendly introduction to current psychometric theories of cognitive abilities, emphasizing Kevin McGrew‘s CHC theory. Before that could be done, we had to articulate compelling reasons I/O psychologists should care about assessing multiple cognitive abilities. This was a harder sell than I had anticipated.

Formal cognitive testing is not a part of most hiring decisions, though I imagine that employers typically have at least a vague sense of how bright job applicants are. When the hiring process does include formal cognitive testing, typically only general ability tests are used. Robust relationships between various aspects of job performance and general ability test scores have been established.

In comparison, the idea that multiple abilities should be measured and used in personnel selection decisions has not fared well in the marketplace of ideas. To explain this, there is no need to appeal to some conspiracy of test developers. I’m sure that they would love to develop and sell large, expensive, and complex test batteries to businesses. There is also no need to suppose that I/O psychology is peculiarly infected with a particularly virulent strain of *g* zealotry and that proponents of multiple ability theories have been unfairly excluded.

To the contrary, specific ability assessment has been given quite a bit of attention in the I/O psychology literature, mostly from researchers sympathetic to the idea of going beyond the assessment of general ability. Dozens (if not hundreds) of high-quality studies were conducted to test whether using specific ability measures added useful information beyond general ability measures. In general, specific ability measures provide only modest amounts of additional information beyond what can be had from general ability scores (Δ*R*^{2} ≈ 0.02–0.06). In most cases, this incremental validity was not large enough to justify the added time, effort, and expense needed to measure multiple specific abilities. Thus it makes sense that relatively short measures of general ability have been preferred to longer, more complex measures of multiple abilities.

However, there are several reasons that the omission of specific ability tests in hiring decisions should be reexamined:

- Since the time that those high quality studies were conducted, multidimensional theories of intelligence have advanced, and we have a better sense of which specific abilities might be important for specific tasks (e.g., working memory capacity for air traffic controllers). The tests measuring these specific abilities have also improved considerably.
- With computerized administration, scoring, and interpretation, the cost of assessment and interpretation of multiple abilities is potentially far lower than it was in the past. Organizations that make use of the admittedly modest incremental validity of specific ability assessments would likely have a small but substantial advantage over organizations that do not. Over the long run, small advantages often accumulate into large advantages.
- Measurement of specific abilities opens up degrees of freedom in balancing the need to maintain the predictive validity of cognitive ability assessments and the need to reduce the adverse impact on applicants from disadvantaged minority groups that can occur when using such assessments. Thus, organizations can benefit from using cognitive ability assessments in hiring decisions without sacrificing the benefits of diversity.

The publishers of Human Resource Management Review have made our paper available to download for free until January 25th, 2015.

]]>

Recent gems:

From #251

The first caveat of writing reports is that readers will strive mightily to attach significant meaning to anything we write in the report. The second caveat is that readers will focus particularly on statements and numbers that are unimportant, potentially misleading, or — whenever possible — both. This is the voice of bitter experience.

Also from #251

Planning is so important that people are beginning to indulge in “preplanning,” which I suppose is better than “postplanning” after the fact. One activity we often do not plan is evaluations.

From #207:

I still recall one principal telling the entire team that, if he could not trust the spelling in my report, he could not trust any of the information in it. This happened recently (about 1975), so it is fresh in my mind. Names of tests are important to spell correctly. Alan and Nadeen

Kaufmanspell their last name with a singlefand only onen. DavidWechslerspelled his name as shown, never asWeschler. The American version of the Binet-Simon scale was developed atStanfordUniversity, notStandford. I have to keep looking it up, but it isDifferential Ability Scaleseven though it is a scale for several abilities. Richard Woodcock may, for all I know, have attended the concert, but his name is notWoodstock.

]]>

The text version is here.

]]>

Dick Woodcock gave the opening remarks. I loved hearing about the twists and turns of his career and how he made the most of unplanned opportunities. It was rather remarkable how diverse his contributions are (including an electronic Braille typewriter). Then he stressed the importance of communicating test results in ways that non-specialists can understand. He speculated on what psychological testing will look like in the future, focusing on integrative software that will guide test selection and interpretation in more sophisticated ways than has hitherto been possible. Given that he has been creating the future of assessment for decades now, I am betting that he is likely to be right. Later he graciously answered my questions about the WJ77 and how he came up with what I consider to be among the most ingenious test paradigms we have.

After a short break, Kevin McGrew gave us a wild ride of a talk about advances in CHC Theory. Actually it was more like a romp than a ride. I tried to keep track of all the interesting ideas for future research he presented but there were so many I quickly lost count. The visuals were stunning and his energy was infectious. He offered a quick overview of new research from diverse fields about the overlooked importance of auditory processing (beyond the usual focus on phonemic awareness). Later he talked about his evolving conceptualization of the memory factors in CHC theory and role of complexity in psychometric tests. My favorite part of the talk was a masterful presentation of information processing theory, judiciously supplemented with very clever animations.

After lunch, Cathy Fiorello gave one of the most thoughtful presentations I have ever heard. Instead of contrasting nomothetic and idiographic approaches to psychological assessment, Cathy stressed their integration. Most of the time, nomothetic interpretations are good first approximations and often are sufficient. However, there are certain test behaviors and other indicators that a more nuanced interpretation of the underlying processes of performance is warranted. Cathy asserted (and I agree) that well trained and highly experienced practitioners can get very good at spotting unusual patterns of test performance that completely alter our interpretations of test scores. She called on her fellow scholars to develop and refine methods of assessing these patterns so that practitioners do not require many years of experience to develop their expertise. She was not merely balanced in her remarks—lip service to a sort of bland pluralism is an easy and rather lazy trick to seem wise. Instead, she offered fresh insight and nuance in her balanced and integrative approach to cognitive and neuropsychological assessment. That is, she did the hard work of offering clear guidelines of how to integrate nomothetic and idiographic methods, all the while frankly acknowledging the limits of what can be known.

]]>

The images from the poster are from a single exploratory model based on a clinical sample of 865 college students. The model was so big and complex I had to split the path diagram into two images:

]]>

It is true, mathematically, that the expected profile IS flat. However, this does not mean that flat profiles are common. There is a very large set of possible profiles and only a tiny fraction are perfectly flat. Profiles that are *nearly* flat are not particularly common, either. Variability is the norm.

Sometimes it helps to get a sense of just how uneven cognitive profiles typically are. That is, it is good to fine-tune our intuitions about the typical profile with many exemplars. Otherwise it is easy to convince ourselves that the reason that we see so many interesting profiles is that we only assess people with interesting problems.

If we use the correlation matrix from the WAIS-IV to randomly simulate multivariate normal profiles, we can see that even in the general population, flat, “plain-vanilla” profiles are relatively rare. There are features that draw the eye in most profiles.

If cognitive abilities were uncorrelated, profiles would be much more uneven than they are. But even with moderately strong positive correlations, there is still room for quite a bit of within-person variability.

Let’s see what happens when we look at profiles that have the exact same Full Scale IQ (80, in this case). The conditional distributions of the remaining scores are seen in the “violin” plots. There is still considerable diversity of profile shape even though the Full Scale IQ is held constant.

Note that the supplemental subtests have wider conditional distributions because they are not included in the Full Scale IQ, not necessarily because they are less *g*-loaded.

]]>

The second thing I don’t like about *variance explained* is the whole “explained” business. As I mentioned in my last post, v*ariance explained* does not actually mean that we have explained anything, at least in a causal sense. That is, it does not imply that we know what is going on. It simply means that we can use one or more variables to predict things more accurately than before.

In many models, if *X* is correlated with *Y*, *X* can be said to “explain” variance in *Y* even though *X* does not really cause *Y*. However, in some situations the term *variance explained* is accurate in every sense:

In the model above, the arrow means that *X* really is a partial cause of *Y*. Why does *Y* vary? Because of variability in *X*, at least in part. In this example, 80% of *Y’*s variance is due to *X*, with the remaining variance due to something else (somewhat misleadingly termed *error*). It is not an “error” in that something is wrong or that someone is making a mistake. It is merely that which causes our predictions of *Y* to be off. *Prediction error* is probably not a single variable. It it likely to be the sum total of many influences.

Because *X* and *error* are uncorrelated z-scores in this example, the path coefficients are equal to the correlations with *Y*. Squaring the correlation coefficients yields the *variance explained*. The coefficients for *X* and *error* are actually the square roots of .8 and .2, respectively. Squaring the coefficients tells us that *X* explains 80% of the variance in *Y* and *error* explains the rest.

Okay, if *X* predicts *Y*, then the *variance explained* is equal to the correlation coefficient squared. Unfortunately, this is merely a formula. It does not help us understand what it means. Perhaps this visualization will help:

If you need to guess every value of *Y* but you know nothing about *Y* except that it has a mean of zero, then you should guess zero every time. You’ll be wrong most of the time, but pursuing other strategies will result in even larger errors. The variance of your prediction errors will be equal to the variance of *Y*. In the picture above, this corresponds to a regression line that passes through the mean of *Y* and has a slope of zero. No matter what *X* is, you guess that *Y* is zero. The squared vertical distance from *Y* to the line is represented by the translucent squares. The average area of the squares is the variance of *Y*.

If you happen to know the value of *X* each time you need to guess what *Y* will be, then you can use a regression equation to make a better guess. Your prediction of *Y* is called *Y*-hat (*Ŷ*):

When *X* and *Y* have the same variance, the slope of the regression line is equal to the correlation coefficient, 0.89. The distance from *Ŷ* (the predicted value of *Y*) to the actual value of *Y* is the prediction error. In the picture above, the variance of the prediction errors (0.2) is the average size of the squares when the slope is equal to the correlation coefficient.

Thus, when *X* is not used to predict *Y*, our prediction errors have a variance of 1. When we *do* use *X* to predict *Y*, the average size of the prediction errors shrinks from 1 to 0.2, an 80% reduction. This is what is meant when we say that “*X* explains 80% of the variance in *Y*.” It is the proportion by which the variance of the prediction errors shrinks.

Suppose that we flip 50 coins and record how many heads there are. We do this over and over. The values we record constitute the variable *Y*. The number of heads we get each time we flip a coin happens to have a binomial distribution. The mean of a binomial distribution is determined by the probability *p* of an event occurring on a single trial (i.e., getting a head on a single toss) and the number of events *k* (i.e., the number of coins thrown). As *k* increases, the binomial distribution begins to resemble the normal distribution. The probability *p* of getting a head on any one coin toss is 0.5 and the number of coins *k* is 50. The mean number of heads over the long run is:

The variance of the binomial distribution:

Before we toss the coins, we should guess that we will toss an average number of heads, 25. We will be wrong much of the time but our prediction errors will be as small as they can be, over the long run. The variance of our prediction errors is equal to the variance of *Y*, 12.5.

Now suppose that after tossing 80% of our coins (i.e., 40 coins), we count the number of heads. This value is recorded as variable *X*. The remaining 20% of the coins (10 coins) are then tossed and the total number of heads is counted from all 50 coins. We can use a regression equation to predict *Y* from *X*. The intercept will be the mean number of heads from the remaining 10 coins:

In the diagram below, each peg represents a coin toss. If the outcome is heads, the dot moves right. If the outcome is tails, the dot moves left. The purple line represents the probability distribution of *Y* before any coin has been tossed.

When the dot gets to the red line (after 40 tosses or 80% of the total), we can make a new guess as to what *Y* is going to be. This conditional distribution is represented by a blue line. The variance of the conditional distribution has a mean equal to *Ŷ*, with a variance of 2.5 (the variance of the 10 remaining coins).

The variability in *Y* is caused by the outcomes of 50 coin tosses. If 80% of those coins are the variable *X*, then *X* explains 80% of the variance in *Y*. The remaining 10 coins represent the variability of *Y* that is not determined by *X* (i.e., the error term). They determine 20% of the variance in *Y*.

If X represented only the first 20 of 50 coins, then *X* would explain 40% of the variance in *Y*.

]]>

*Statistical significance*: This term is so universally hated I am surprised that we haven’t held a convention and banned its use. How many journalists have been mislead by researchers’ technical use of*significance*? I wish we said something like “not merely random” or “probably not zero.”*Type I/Type II error*: It is hard to remember which is which because the terms don’t convey any clues as to what they mean. I wish more informative metaphors were used such as*false hit*and*false miss*.*Power*:*Statistical power*refers to the probability that the null hypothesis will be rejected, provided that the null hypothesis is false. The term is not self-explanatory and requires memorization! I wish we used a better term such as*true hit rate*or*false null rejection rate*. While we’re at it,*α*and*β*are not much better.*False hit rate*(or*true null rejection rate*) and*false miss rate*(or*false null retention rate*) would be easier to remember.*Prediction error*: The word*error*in English typically refers to an action that results in harm that could have been avoided if better choices had been made. In the context of statistical models, prediction errors are what you get wrong even though you have done everything right! I wish there were a word that referred to actions that were done in good faith yet resulted in unforeseeable harm. In this case, we already have a perfectly good substitute term that is widely used:*disturbance*. I suppose that the connotations of*disturbance*could generate different misunderstandings but in my estimation they are not as bad as those generated by*error*. I wish that we could just use the term*residuals*but that refers to something slightly different: the estimate of an error (residual:error::statistic:parameter). We can only know the errors if we know the true model parameters.*Variance explained*: This term works if the predictor is a cause of the criterion variable. However, when it is simply a correlate, it misleadingly suggests that we now understand what is going on. I wish the term were something more neutral such as*variance predicted*.*Moderator/Mediator*: At least in English, these terms sound so much alike that they are easily confused. I think that we should dump*moderator*along with related terms*interaction effect*,*simple main effect*, and*simple slope*. I think that the term*conditional effects*is more descriptive and straightforward.*Biased*: This word is hard to use in its technical sense when talking to non-statisticians. It sounds like we are talking about bigoted statistics! Unfortunately I can’t think of good alternative to it (though I can think of some awkward ones like*stable inaccuracy*).*Degrees of freedom*: For me, this concept is extremely difficult to explain properly in an introductory course. Students are confused about what*degrees*have to do with it (or for that matter,*freedom*). I don’t know if I have a good replacement term (*independent dimensions*?*non-redundancy index*?*matrix rank*?).*True score*: This term sounds like it refers to the Aristotelian truth when in fact it is merely the long-term average score if there were no carryover effects of repeated measurement. Thus, a person’s true score on one IQ test might be quite different from the same person’s true score on another IQ test. Neither true score refers to the person’s “true cognitive ability.” To avoid this confusion, I would prefer something like the*individual expected value,*or IEV for short.*Reliability*: In typical usage,*reliability*refers to morally desirable traits such as trustworthiness and truthfulness. When statisticians refer to the reliability of scores or experimental results, to the untrained ear it probably sounds like we are talking about validity. I would prefer to talk about*stability*,*consistency*, or*precision*instead.

I am sure that there are many more!

]]>

Let’s sidestep some difficult questions about what exactly an “academic deficit” is and for the sake of convenience pretend that it is a score at least 1 standard deviation below the mean on a well normed test administered by a competent psychologist with good clinical skills.

Suppose that we start with the 9 core WJ III achievement tests (the answers will not be all that different with the new WJ IV):

Reading | Writing | Mathematics | |
---|---|---|---|

Skills | Letter-Word Identification | Spelling | Calculation |

Applications | Passage Comprehension | Writing Samples | Applied Problems |

Fluency | Reading Fluency | Writing Fluency | Math Fluency |

What is the percentage of the population that does not have any score below 85? If we can assume that the scores are multivariate normal, the answer can be found using data simulation or via the cumulative density function of the multivariate normal distribution. I gave examples of both methods in the previous post. If we use the correlation matrix for the 6 to 9 age group of the WJ III NU, about 47% of the population has no academic scores below 85.

Using the same methods we can estimate what percent of the population has no academic scores below various thresholds. Subtracting these numbers from 100%, we can see that fairly large proportions have at least one low score.

Threshold | % with no scores below the threshold | % with at least one score below the threshold |
---|---|---|

85 | 47% | 53% |

80 | 63% | 37% |

75 | 77% | 23% |

70 | 87% | 13% |

The numbers in the table above include people with very low cognitive ability. It would be more informative if we could control for a person’s measured cognitive abilities.

Suppose that an individual has index scores of exactly 100 for all 14 subtests that are used to calculate the WJ III GIA Extended. We can calculate the means and the covariance matrix of the achievement tests for all people with this particular cognitive profile. We will make use of the conditional multivariate normal distribution. As explained here (or here), we partition the academic tests and the cognitive predictor tests like so:

- and are the mean vectors for the academic and cognitive variables, respectively.
- and are the covariances matrices of academic and cognitive variables, respectively.
- is the matrix of covariances between the academic and cognitive variables.

If the cognitive variables have the vector of particular values , then the conditional mean vector of the academic variables is:

The conditional covariance matrix:

If we can assume multivariate normality, we can use these equations, to estimate the proportion of people with no scores below any threshold on any set of scores conditioned on any set of predictor scores. In this example, about 51% of people with scores of exactly 100 on all 14 cognitive predictors have no scores below 85 on the 9 academic tests. About 96% of people with this cognitive profile have no scores below 70.

Because there is an extremely large number of possible cognitive profiles, I cannot show what would happen with all of them. Instead, I will show what happens with all of the perfectly flat profiles from all 14 cognitive scores equal to 70 to all 14 cognitive scores equal to 130.

Here is what happens with the same procedure when the threshold is 70 for the academic scores:

Here is the R code I used to perform the calculations. You can adapt it to other situations fairly easily (different tests, thresholds, and profiles).

library(mvtnorm) WJ <- matrix(c( 1,0.49,0.31,0.46,0.57,0.28,0.37,0.77,0.36,0.15,0.24,0.49,0.25,0.39,0.61,0.6,0.53,0.53,0.5,0.41,0.43,0.57,0.28, #Verbal Comprehension 0.49,1,0.27,0.32,0.47,0.26,0.32,0.42,0.25,0.21,0.2,0.41,0.21,0.28,0.38,0.43,0.31,0.36,0.33,0.25,0.29,0.4,0.18, #Visual-Auditory Learning 0.31,0.27,1,0.25,0.33,0.18,0.21,0.28,0.13,0.16,0.1,0.33,0.13,0.17,0.25,0.22,0.18,0.21,0.19,0.13,0.25,0.31,0.11, #Spatial Relations 0.46,0.32,0.25,1,0.36,0.17,0.26,0.44,0.19,0.13,0.26,0.31,0.18,0.36,0.4,0.36,0.32,0.29,0.31,0.27,0.22,0.33,0.2, #Sound Blending 0.57,0.47,0.33,0.36,1,0.29,0.37,0.49,0.28,0.16,0.23,0.57,0.24,0.35,0.4,0.44,0.36,0.38,0.4,0.34,0.39,0.53,0.27, #Concept Formation 0.28,0.26,0.18,0.17,0.29,1,0.35,0.25,0.36,0.17,0.27,0.29,0.53,0.22,0.37,0.32,0.52,0.42,0.32,0.49,0.42,0.37,0.61, #Visual Matching 0.37,0.32,0.21,0.26,0.37,0.35,1,0.3,0.24,0.13,0.22,0.33,0.21,0.35,0.39,0.34,0.38,0.38,0.36,0.33,0.38,0.43,0.36, #Numbers Reversed 0.77,0.42,0.28,0.44,0.49,0.25,0.3,1,0.37,0.15,0.23,0.43,0.23,0.37,0.56,0.55,0.51,0.47,0.47,0.39,0.36,0.51,0.26, #General Information 0.36,0.25,0.13,0.19,0.28,0.36,0.24,0.37,1,0.1,0.22,0.21,0.38,0.26,0.26,0.33,0.4,0.28,0.27,0.39,0.21,0.25,0.32, #Retrieval Fluency 0.15,0.21,0.16,0.13,0.16,0.17,0.13,0.15,0.1,1,0.06,0.16,0.17,0.09,0.11,0.09,0.13,0.1,0.12,0.13,0.07,0.12,0.07, #Picture Recognition 0.24,0.2,0.1,0.26,0.23,0.27,0.22,0.23,0.22,0.06,1,0.22,0.35,0.2,0.16,0.22,0.25,0.21,0.19,0.26,0.17,0.19,0.21, #Auditory Attention 0.49,0.41,0.33,0.31,0.57,0.29,0.33,0.43,0.21,0.16,0.22,1,0.2,0.3,0.33,0.38,0.29,0.31,0.3,0.25,0.42,0.47,0.25, #Analysis-Synthesis 0.25,0.21,0.13,0.18,0.24,0.53,0.21,0.23,0.38,0.17,0.35,0.2,1,0.15,0.19,0.22,0.37,0.21,0.2,0.4,0.23,0.19,0.37, #Decision Speed 0.39,0.28,0.17,0.36,0.35,0.22,0.35,0.37,0.26,0.09,0.2,0.3,0.15,1,0.39,0.36,0.32,0.3,0.3,0.3,0.25,0.33,0.23, #Memory for Words 0.61,0.38,0.25,0.4,0.4,0.37,0.39,0.56,0.26,0.11,0.16,0.33,0.19,0.39,1,0.58,0.59,0.64,0.5,0.48,0.46,0.52,0.42, #Letter-Word Identification 0.6,0.43,0.22,0.36,0.44,0.32,0.34,0.55,0.33,0.09,0.22,0.38,0.22,0.36,0.58,1,0.52,0.52,0.47,0.42,0.43,0.49,0.36, #Passage Comprehension 0.53,0.31,0.18,0.32,0.36,0.52,0.38,0.51,0.4,0.13,0.25,0.29,0.37,0.32,0.59,0.52,1,0.58,0.48,0.65,0.42,0.43,0.59, #Reading Fluency 0.53,0.36,0.21,0.29,0.38,0.42,0.38,0.47,0.28,0.1,0.21,0.31,0.21,0.3,0.64,0.52,0.58,1,0.5,0.49,0.46,0.47,0.49, #Spelling 0.5,0.33,0.19,0.31,0.4,0.32,0.36,0.47,0.27,0.12,0.19,0.3,0.2,0.3,0.5,0.47,0.48,0.5,1,0.44,0.41,0.46,0.36, #Writing Samples 0.41,0.25,0.13,0.27,0.34,0.49,0.33,0.39,0.39,0.13,0.26,0.25,0.4,0.3,0.48,0.42,0.65,0.49,0.44,1,0.38,0.37,0.55, #Writing Fluency 0.43,0.29,0.25,0.22,0.39,0.42,0.38,0.36,0.21,0.07,0.17,0.42,0.23,0.25,0.46,0.43,0.42,0.46,0.41,0.38,1,0.57,0.51, #Calculation 0.57,0.4,0.31,0.33,0.53,0.37,0.43,0.51,0.25,0.12,0.19,0.47,0.19,0.33,0.52,0.49,0.43,0.47,0.46,0.37,0.57,1,0.46, #Applied Problems 0.28,0.18,0.11,0.2,0.27,0.61,0.36,0.26,0.32,0.07,0.21,0.25,0.37,0.23,0.42,0.36,0.59,0.49,0.36,0.55,0.51,0.46,1), nrow= 23, byrow=TRUE) #Math Fluency WJNames <- c("Verbal Comprehension", "Visual-Auditory Learning", "Spatial Relations", "Sound Blending", "Concept Formation", "Visual Matching", "Numbers Reversed", "General Information", "Retrieval Fluency", "Picture Recognition", "Auditory Attention", "Analysis-Synthesis", "Decision Speed", "Memory for Words", "Letter-Word Identification", "Passage Comprehension", "Reading Fluency", "Spelling", "Writing Samples", "Writing Fluency", "Calculation", "Applied Problems", "Math Fluency") rownames(WJ) <- colnames(WJ) <- WJNames #Number of tests k<-length(WJNames) #Means and standard deviations of tests mu<-rep(100,k) sd<-rep(15,k) #Covariance matrix sigma<-diag(sd)%*%WJ%*%diag(sd) colnames(sigma)<-rownames(sigma)<-WJNames #Vector identifying predictors (WJ Cog) p<-seq(1,14) #Threshold for low scores Threshold<-85 #Proportion of population who have no scores below the threshold pmvnorm(lower=rep(Threshold,length(WJNames[-p])),upper=rep(Inf,length(WJNames[-p])),sigma=sigma[-p,-p],mean=mu[-p])[1] #Predictor test scores for an individual x<-rep(100,length(p)) names(x)<-WJNames[p] #Condition means and covariance matrix condMu<-c(mu[-p] + sigma[-p,p] %*% solve(sigma[p,p]) %*% (x-mu[p])) condSigma<-sigma[-p,-p] - sigma[-p,p] %*% solve(sigma[p,p]) %*% sigma[p,-p] #Proportion of people with the same predictor scores as this individual who have no scores below the threshold pmvnorm(lower=rep(Threshold,length(WJNames[-p])),upper=rep(Inf,length(WJNames[-p])),sigma=condSigma,mean=condMu)[1]

]]>

`NORMSDIST`

will tell us the answer:
=NORMSDIST(-2) =0.023

In R, the `pnorm`

function gives the same answer:

pnorm(-2)

How unusual is it to have multiple scores below the threshold? The answer depends on how correlated the scores are. If we can assume that the scores are multivariate normal, Crawford and colleagues (2007) show us how to obtain reasonable estimates using simulated data. Here is a script in R that depends on the `mvtnorm`

package. Suppose that the 10 subtests of the WAIS-IV have correlations as depicted below. Because the subtests have a mean of 10 and a standard deviation of 3, the scores are unusually low if 4 or lower.

#WAIS-IV subtest names WAISSubtests <- c("BD", "SI", "DS", "MR", "VO", "AR", "SS", "VP", "IN", "CD") # WAIS-IV correlations WAISCor <- rbind( c(1.00,0.49,0.45,0.54,0.45,0.50,0.41,0.64,0.44,0.40), #BD c(0.49,1.00,0.48,0.51,0.74,0.54,0.35,0.44,0.64,0.41), #SI c(0.45,0.48,1.00,0.47,0.50,0.60,0.40,0.40,0.43,0.45), #DS c(0.54,0.51,0.47,1.00,0.51,0.52,0.39,0.53,0.49,0.45), #MR c(0.45,0.74,0.50,0.51,1.00,0.57,0.34,0.42,0.73,0.41), #VO c(0.50,0.54,0.60,0.52,0.57,1.00,0.37,0.48,0.57,0.43), #AR c(0.41,0.35,0.40,0.39,0.34,0.37,1.00,0.38,0.34,0.65), #SS c(0.64,0.44,0.40,0.53,0.42,0.48,0.38,1.00,0.43,0.37), #VP c(0.44,0.64,0.43,0.49,0.73,0.57,0.34,0.43,1.00,0.34), #IN c(0.40,0.41,0.45,0.45,0.41,0.43,0.65,0.37,0.34,1.00)) #CD rownames(WAISCor) <- colnames(WAISCor) <- WAISSubtests #Means WAISMeans<-rep(10,length(WAISSubtests)) #Standard deviations WAISSD<-rep(3,length(WAISSubtests)) #Covariance Matrix WAISCov<-WAISCor*WAISSD%*%t(WAISSD) #Sample size SampleSize<-1000000 #Load mvtnorm package library(mvtnorm) #Make simulated data d<-rmvnorm(n=SampleSize,mean=WAISMeans,sigma=WAISCov) #To make this more realistic, you can round all scores to the nearest integer (d<-round(d)) #Threshold for abnormality Threshold<-4 #Which scores are less than or equal to threshold Abnormal<- d<=Threshold #Number of scores less than or equal to threshold nAbnormal<-rowSums(Abnormal) #Frequency distribution table p<-c(table(nAbnormal)/SampleSize) #Plot barplot(p,axes=F,las=1, xlim=c(0,length(p)*1.2),ylim=c(0,1), bty="n",pch=16,col="royalblue2", xlab="Number of WAIS-IV subtest scores less than or equal to 4", ylab="Proportion") axis(2,at=seq(0,1,0.1),las=1) text(x=0.7+0:10*1.2,y=p,labels=formatC(p,digits=2),cex=0.7,pos=3,adj=0.5)

The simulation method works very well, especially if the sample size is very large. An alternate method that gives more precise numbers is to estimate how much of the multivariate normal distribution is within certain bounds. That is, we find all of the regions of the multivariate normal distribution in which one and only one test is below a threshold and then add up all the probabilities. The process is repeated to find all regions in which two and only two tests are below a threshold. Repeat the process, with 3 tests, 4 tests, and so on. This is tedious to do by hand but only takes a few lines of code do automatically.

AbnormalPrevalance<-function(Cor,Mean=0,SD=1,Threshold){ require(mvtnorm) k<-nrow(Cor) p<-rep(0,k) zThreshold<-(Threshold-Mean)/SD for (n in 1:k){ combos<-combn(1:k,n) ncombos<-ncol(combos) for (i in 1:ncombos){ u<-rep(Inf,k) u[combos[,i]]<-zThreshold l<-rep(-Inf,k) l[seq(1,k)[-combos[,i]]]<-zThreshold p[n]<-p[n]+pmvnorm(lower=l,upper=u,mean=rep(0,k),sigma=Cor)[1] } } p<-c(1-sum(p),p) names(p)<-0:k barplot(p,axes=F,las=1,xlim=c(0,length(p)*1.2),ylim=c(0,1), bty="n",pch=16,col="royalblue2", xlab=bquote("Number of scores less than or equal to " * .(Threshold)), ylab="Proportion") axis(2,at=seq(0,1,0.1),las=1) text(x=0.7+0:10*1.2,y=p,labels=formatC(p,digits=2),cex=0.7,pos=3,adj=0.5) return(p) } Proportions<-AbnormalPrevalance(Cor=WAISCor,Mean=10,SD=3,Threshold=4)

Using this method, the results are nearly the same but slightly more accurate. If the number of tests is large, the code can take a long time to run.

]]>