Statistics

Karl Pearson on why the idea of the correlation coefficient, not the formula, was the real breakthrough

The idea of correlation (i.e., mutual influence/intimate connection), indeed even the word correlation, existed for centuries before Francis Galton. Galton’s (1888) revolutionary idea was not that correlation exists but that it can be quantified. The correlation coefficient most often used today bears the name of statistician Karl Pearson, Galton’s friend and biographer. Though Pearson refined Galton’s formulas, providing them with a lasting and secure mathematical foundation, Pearson (1930) was quite clear that it was the idea of the correlation coefficient, not the formula, that was the real breakthrough:

Up to 1889 men of science had thought only in terms of causation, in future they were to admit another working category, that of correlation, and thus open to quantitative analysis wide fields of medical, psychological and sociological research. Turning to the writings of Turgot and Condorcet, who felt convinced that mathematics were applicable to social phenomena, we realize to-day how little progress in that direction was possible because they lacked the key—correlation—to the treasure chamber. Condorcet often and Laplace occasionally failed because this idea of correlation was not in their minds. Much of Quetelet’s work and that of the earlier (and many of the modern) anthropologists is sterile for like reasons.

Galton turning over two different problems in his mind reached the conception of correlation: A is not the sole cause of B, but it contributes to the production of B; there may be other, many or few, causes, some of which we do not know and may never know. Are we then to exclude from mathematical analysis all such cases of incomplete causation? Galton’s answer was: “No, we must endeavor to find a quantitative measure of this degree of partial causation.” This measure of partial causation was the germ of the broad category—that of correlation, which was to replace not only in the minds of many of us the old category of causation, but deeply to influence our outlook on the universe. (pp. 1–2)

References

Galton, F. (1888). Co-relations and their measurement, chiefly from anthropometric data. Proceedings of the Royal Society of London, 45, 135–145.

Pearson, K. (1930). The life letters and labours of Francis Galton: Volume III. Researches of middle life. Cambridge, England: Cambridge University Press.

Standard
Cognitive Assessment, Psychometrics, Statistics, Tutorial, Video

Conditional normal distributions provide useful information in psychological assessment

Conditional Normal Distribution

Conditional Normal Distribution

Conditional normal distributions are really useful in psychological assessment. We can use them to answer questions like:

  • How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?
  • If that person also has a score of 80 on a test of working memory capacity, how much does the risk of scoring 90 or lower on reading comprehension increase?

What follows might be mathematically daunting. Just let it wash over you if it becomes confusing. At the end, there is a video in which I will show how to use a spreadsheet that will do all the calculations.

Unconditional Normal Distributions

Suppose that variable Y represents reading comprehension test scores. Here is a fancy way of saying Y is normally distributed with a mean of 100 and a standard deviation of 15:

Y\sim N(100,15^2)

In this notation, “~” means is distributed as, and N means normally distributed with a particular mean (μ) and variance (σ2).

If we know literally nothing about a person from this population, our best guess is that the person’s reading comprehension score is at the population mean. One way to say this is that the person’s expected value on reading comprehension is the population mean:

E(Y)=\mu_Y = 100

The 95% confidence interval around this guess is :

95\%\, \text{CI} = \mu_Y \pm z_{95\%} \sigma_Y

95\%\, \text{CI} \approx 100 \pm 1.96*15 = 70.6 \text{ to } 129.4

Unconditional Normal Distribution with 95% CI

Unconditional Normal Distribution with 95% CI

Conditional Normal Distributions

Simple Linear Regression

Now, suppose that we know one thing about the person: the person’s score on a vocabulary test. We can let X represent the vocabulary score and its distribution is the same as that of Y:

X\sim N(100,15^2)

If we know that this person scored 120 on vocabulary (X), what is our best guess as to what the person scored on reading comprehension (Y)? This guess is a conditional expected value. It is “conditional” in the sense that the expected value of Y depends on what value X has. The pipe symbol “|” is used to note a condition like so:

E(Y|X=120)

This means, “What is our best guess for Y if X is 120?”

What if we don’t want to be specific about the value of X but want to refer to any particular value of X? Oddly enough, it is traditional to use the lowercase x for that. So, X refers to the variable as a whole and x refers to any particular value of variable X. So if I know that variable X happens to be a particular value x, the expected value of Y is:

E(Y|X=x)=\sigma_Y \rho_{XY}\dfrac{x-\mu_X}{\sigma_X}+\mu_Y

where ρXY is the correlation between X and Y.

You might recognize that this is a linear regression formula and that:

E(Y|X=x)=\hat{Y}

where “Y-hat” (Ŷ) is the predicted value of Y when X is known.

Let’s assume that the relationship between X and Y is bivariate normal like in the image at the top of the post:

\begin{bmatrix}X\\Y\end{bmatrix}\sim N\left(\begin{bmatrix}\mu_X\\ \mu_Y\end{bmatrix}\begin{matrix} \\,\end{matrix}\begin{bmatrix}\sigma_X^2&\rho_{XY}\sigma_X\sigma_Y\\ \rho_{XY}\sigma_X\sigma_Y&\sigma_X^2\end{bmatrix}\right)

The first term in the parentheses is the vector of means and the second term (the square matrix in the brackets) is the covariance matrix of X and Y. It is not necessary to understand the notation. The main point is that X and Y are both normal, they have a linear relationship, and the conditional variance of Y at any value of X is the same.

The conditional standard deviation of Y at any particular value of X is:

\sigma_{Y|X=x}=\sigma_Y\sqrt{1-\rho_{xy}^2}

This is the standard deviation of the conditional normal distribution. In the language of regression, it is the standard error of the estimate (σe). It is the standard deviation of the residuals (errors). Residuals are simply the amount by which your guesses differ from the actual values.

e = y - E(Y|X=x)=y-\hat{Y}

So,

\sigma_{Y|X=x}=\sigma_e

So, putting all this together, we can answer our question:

How unusual is it for someone with a vocabulary score of 120 to have a score of 90 or lower on reading comprehension?

The expected value of Y (Ŷ) is:

E(Y|X=120)=15\rho_{XY}\dfrac{120-100}{15}+100

Suppose that the correlation is 0.5. Therefore,

E(Y|X=120)=15*0.5\dfrac{120-100}{15}+100=110

This means that among all the people with a vocabulary score of 120, the average is 110 on reading comprehension. Now, how far off from that is 90?

e= y - \hat{Y}=90-110=-20

What is the standard error of the estimate?

\sigma_{Y|X=x}=\sigma_e=\sigma_Y\sqrt{1-\rho_{xy}^2}

\sigma_{Y|X=x}=\sigma_e=15\sqrt{1-0.5^2}\approx12.99

Dividing the residual by the standard error of the estimate (the standard deviation of the conditional normal distribution) gives us a z-score. It represents how far from expectations this individual is in standard deviation units.

z=\dfrac{e}{\sigma_e} \approx\dfrac{-20}{12.99}\approx -1.54

Using the standard normal cumulative distribution function (Φ) gives us the proportion of people scoring 90 or less on reading comprehension (given a vocabulary score of 120).

\Phi(z)\approx\Phi(-1.54)\approx 0.06

In Microsoft Excel, the standard normal cumulative distribution function is NORMSDIST. Thus, entering this into any cell will give the answer:

=NORMSDIST(-1.54)

Conditional Normal when Vocabulary = 120

Conditional normal distribution when Vocabulary = 120

Multiple Regression

What proportion of people score 90 or less on reading comprehension if their vocabulary is 120 but their working memory capacity is 80?

Let’s call vocabulary X1 and working memory capacity X2. Let’s suppose they correlated at 0.3. The correlation matrix among the predictors (RX):

\mathbf{R_X}=\begin{bmatrix}1&\rho_{12}\\ \rho_{12}&1\end{bmatrix}=\begin{bmatrix}1&0.3\\ 0.3&1\end{bmatrix}

The validity coefficients are the correlations of Y with both predictors (RXY):

\mathbf{R}_{XY}=\begin{bmatrix}\rho_{Y1}\\ \rho_{Y2}\end{bmatrix}=\begin{bmatrix}0.5\\ 0.4\end{bmatrix}

The standardized regression coefficients (β) are:

\pmb{\mathbf{\beta}}=\mathbf{R_{X}}^{-1}\mathbf{R}_{XY}\approx\begin{bmatrix}0.418\\ 0.275\end{bmatrix}

Unstandardized coefficients can be obtained by multiplying the standardized coefficients by the standard deviation of Y (σY) and dividing by the standard deviation of the predictors (σX):

\mathbf{b}=\sigma_Y\pmb{\mathbf{\beta}}/\pmb{\mathbf{\sigma}}_X

However, in this case all the variables have the same metric and thus the unstandardized and standardized coefficients are the same.

The vector of predictor means (μX) is used to calculate the intercept (b0):

b_0=\mu_Y-\mathbf{b}' \pmb{\mathbf{\mu}}_X

b_0\approx 100-\begin{bmatrix}0.418\\ 0.275\end{bmatrix}^{'} \begin{bmatrix}100\\ 100\end{bmatrix}\approx 30.769

The predicted score when vocabulary is 120 and working memory capacity is 80 is:

\hat{Y}=b_0 + b_1 X_1 + b_2 X_2

\hat{Y}\approx 30.769+0.418*120+0.275*80\approx 102.9

The error in this case is 90-102.9=-12.9:

The multiple R2 is calculated with the standardized regression coefficients and the validity coefficients.

R^2 = \pmb{\mathbf{\beta}}'\pmb{\mathbf{R}}_{XY}\approx\begin{bmatrix}0.418\\ 0.275\end{bmatrix}^{'} \begin{bmatrix}0.5\\ 0.4\end{bmatrix}\approx0.319

The standard error of the estimate is thus:

\sigma_e=\sigma_Y\sqrt{1-R^2}\approx 15\sqrt{1-0.319^2}\approx 12.38

The proportion of people with vocabulary = 120 and working memory capacity = 80 who score 90 or less is:

\Phi\left(\dfrac{e}{\sigma_e}\right)\approx\Phi\left(\dfrac{-12.9}{12.38}\right)\approx 0.15

Here is a spreadsheet that automates these calculations.

Multiple Regression Spreadsheet

Multiple Regression Spreadsheet

I explain how to use this spreadsheet in this YouTube video:

Standard
Psychometrics, Statistics, Tutorial

A Gentle, Non-Technical Introduction to Factor Analysis

When measuring characteristics of physical objects, there may be some disagreement about the best methods to use but there is little disagreement about which dimensions are being measured. We know that we are measuring length when we use a ruler and we know that we are measuring temperature when we use a thermometer. It is true that heating some materials makes them expand but we are virtually never confused about whether heat and length represent distinct dimensions that are independent of each other. That is, they are independent of each other in the sense that things can be cold and long, cold and short, hot and long, or hot and short.

Unfortunately, we are not nearly as clear about what we are measuring when we attempt to measure psychological dimensions such as personality traits, motivations, beliefs, attitudes, and cognitive abilities. Psychologists often disagree not only about what to name these dimensions but also about how many dimensions there are to measure. For example, you might think that there exists a personality trait called niceness. Another person might disagree with you, arguing that niceness is a vague term that lumps together 2 related but distinguishable traits called friendliness and kindness. Another person could claim that kindness is too general and that we must separate kindness with friends from kindness with strangers.

As you might imagine, these kinds of arguments can quickly lead to hypothesizing the existence of as many different traits as our imaginations can generate. The result would be a hopeless confusion among psychological researchers because they would have no way to agree on what to measure so that they can build upon one another’s findings. Fortunately, there are ways to put some limits on the number of psychological dimensions and come to some degree of consensus about what should be measured. One of the most commonly used of such methods is called factor analysis.

Although the mathematics of factor analysis is complicated, the logic behind it is not difficult to understand. The assumption behind factor analysis is that things that co-occur tend to have a common cause (but not always). For example, fevers, sore throats, stuffy noses, coughs, and sneezes tend to occur at roughly the same time in the same person. Often, they are caused by the same thing, namely, the virus that causes the common cold. Note that although the virus is one thing, its manifestations are quite diverse. In psychological assessment research, we measure a diverse set of abilities, behaviors and symptoms and attempt to deduce which underlying dimensions cause or account for the variations in behavior and symptoms we observe in large groups of people. We measure the relations between various behaviors, symptoms, and test scores with correlation coefficients and use factor analysis to discover patterns of correlation coefficients that suggest the existence of underlying psychological dimensions.

All else being equal, a simple theory is better than a complicated theory. Therefore, factor analysis helps us discover the smallest number of psychological dimensions (i.e., factors) that can account for the correlation patterns in the various behaviors, symptoms, and test scores we observe. For example, imagine that we create 4 different tests that would measure people’s knowledge of vocabulary, grammar, arithmetic, and geometry. If the correlations between all of these tests were 0 (i.e., high scorers on one test are no more likely to score high on the other tests than low scorers), then the factor analysis would suggest to us that we have measured 4 distinct abilities. In the picture below, the correlations between all the tests are displayed in the table. Below that, the theoretical model that would be implied is that there are 4 abilities (shown as ellipses) that influence performance on 4 tests (shown as rectangles). The numbers beside the arrows imply that the abilities and the tests have high but imperfect correlations of 0.9.Independent AbilitiesOf course, you probably recognize that it is very unlikely that the correlations between these tests would be 0. Therefore, imagine that the correlation between the vocabulary and grammar tests is quite high (i.e., high scorers on vocabulary are likely to also score high on grammar and low scorers on vocabulary are likely to score low on grammar). The correlation between arithmetic and geometry is high also. Furthermore, the correlations between the language tests and the mathematics tests is 0. Factor analysis would suggest that we have measured not 4 distinct abilities but rather 2 abilities. Researchers interpreting the results of the factor analysis would have to use their best judgment to decide what to call these 2 abilities. In this case, it would seem reasonable to call them language ability and mathematical ability. These 2 abilities (shown below as ellipses) influence performance on 4 tests (shown as rectangles).Independent FactorsNow imagine that the correlations between all 4 tests is equally high. That is, for example, vocabulary is just as strongly correlated with geometry as it is with grammar. In this case, factor analysis would suggest that the simplest explanation for this pattern of correlations is that there is just 1 factor that causes all of these tests to be equally correlated. We might call this factor general academic ability.General FactorIn reality, if you were to actually measure these 4 abilities, the results would not be so clear. It is likely that all of the correlations would be positive and substantially above 0. It is also likely that the language subtests would correlate more strongly with each other than with the mathematical subtests. In such a case, factor analysis would suggest that language and mathematical abilities are distinct but not entirely independent from each other. That is, language abilities and mathematics abilities are substantially correlated with each other, suggesting that a general academic (or intellectual) ability influences performance in all academic areas. In this model, abilities are arranged in hierarchies with general abilities influencing narrow abilities.

Hierarchical FactorsExploratory Factor Analysis

Factor analysis can help researchers decide how best to summarize large amounts of information about people using just a few scores. For example, when we ask parents to complete questionnaires about behavior problems their children might have, the questionnaires can have hundreds of items. It would take too long and would be too confusing to review every item. Factor analysis can simplify the information while minimizing the loss of detail. Here is an example of a short questionnaire that factor analysis can be used to summarize.

On a scale of 1 to 5, compared to other children his or her age, my child:

  1. gets in fights frequently at school
  2. is defiant to adults
  3. is very impulsive
  4. has stomachaches frequently
  5. is anxious about many things
  6. appears sad much of the time

If we give this questionnaire to a large, representative sample of parents, we can calculate the correlations between the items:

1 2 3 4 5 6
1. gets in fights frequently at school
2. is defiant to adults .81
3. is very impulsive .79 .75
4. has stomachaches frequently .42 .38 .36
5. is anxious about many things .39 .34 .34 .77
6. appears sad much of the time .37 .34 .32 .77 .74

Using this set of correlation coefficients, factor analysis suggests that there are 2 factors being measured by this behavior rating scale. The logic of factor analysis suggests that the reason items 1-3 have high correlations with each other is that each of them has a high correlation with the first factor. Similarly, items 4-6 have high correlations with each other because they have high correlations with the second factor. The correlations that the items have with the hypothesized factors are called factor loadings. The factor loadings can be seen in the chart below:

Factor 1

Factor 2

1. gets in fights frequently at school

.91

.03

2. is defiant to adults

.88

-.01

3. is very impulsive

.86

-.01

4. has stomachaches frequently

.02

.89

5. is anxious about many things

.01

.86

6. appears sad much of the time

-.02

.87

Factor analysis tells us which items “load” on which factors but it cannot interpret the meaning of the factors. Usually researchers look at all of the items that load on a factor and use their intuition or knowledge of theory to identify what the items have in common. In this case, Factor 1 could receive any number of names such as Conduct Problems, Acting Out, or Externalizing Behaviors. Likewise, Factor 2 could be called Mood Problems, Negative Affectivity, or Internalizing Behaviors. Thus, the problems on this behavior rating scale can be summarized fairly efficiently with just 2 scores. In this example, a reduction of 6 scores to 2 scores may not seem terribly useful. In actual behavior rating scales, factor analysis can reduce the overwhelming complexity of hundreds of different behavior problems to a more manageable number of scores that help professionals more easily conceptualize individual cases.

It should be noted that factor analysis also calculates the correlation among factors. If a large number of factors are identified and there are substantial correlations (i.e., significantly larger than 0) among factors, this new correlation matrix can be factor analyzed also to obtain second-order factors. These factors, in turn, can be analyzed to obtain third-order factors. Theoretically, it is possible to have even higher order factors but most researchers rarely find it necessary to go beyond third-order factors. The g-factor from intelligence test data is an example of a third-order factor that emerges because all tests of cognitive abilities are positively correlated. In our example above, the 2 factors have a correlation of .46, suggesting that children who have externalizing problems are also at risk of having internalizing problems. It is therefore reasonable to calculate a second-order factor score that measures the overall level of behavior problems.

This example illustrates the most commonly used type of factor analysis: exploratory factor analysis. Exploratory factor analysis is helpful when we wish to summarize data efficiently, we are not sure how many factors are present in our data, or we are not sure which items load on which factors.

Confirmatory Factor Analysis

Confirmatory factor analysis is a method that researchers can use to test highly specific hypotheses. For example, a researcher might want to know if the 2 different types of items on the WISC-IV Digit Span subtest measures the same ability or 2 different abilities. On the Digits Forward type of item, the child must repeat a string of digits in the same order in which they were heard. On the Digits Backward type of item, the child must repeat the string of digits in reverse order. Some researchers believe that repeating numbers verbatim measures auditory short-term memory storage capacity and that repeating numbers in reverse order measures executive control, the ability to allocate attentional resources efficiently to solve multi-step problems. Typically, clinicians add the raw scores of both types of items to produce a single score. If the 2 item types measure different abilities, adding the raw scores together is like adding apples and orangutans. If, however, they measure the same ability, adding the scores together is valid and will produce a more reliable score than using separate scores.

To test this hypothesis, we can use confirmatory factor analysis to see if the 2 item types measure different abilities. We would need to identify or invent several tests that are likely to measure the 2 separate abilities that we believe are measured by the 2 types of Digit Span items. Usually, using 3 tests per factor is sufficient.

Next, we specify the hypotheses, or models, we wish to test:

  1. All of the tests measure the same ability. A graphical representation of a hypothesis in confirmatory factor analysis is called a path diagram. Tests are drawn with rectangles and hypothetical factors are drawn with ovals. The correlations between tests and factors are drawn with arrows. The path diagram for this hypothesis would look like this:

WM

  1. Both Digits Forward and Digits Backward measure short-term memory storage capacity and are distinct from executive control. The path diagram would look like this (the curved arrow allows for the possibility that the 2 factors might be correlated):

WM1

  1. Digits Forward and Digits Backward measure different abilities. The path diagram would look like this:

WM2

Confirmatory factor analysis produces a number of statistics, called fit statistics that tell us which of the models or hypotheses we tested are most in agreement with the data. Studying the results, we can select the best model or perhaps generate a new model if none of them provide a good “fit” with the data. With structural equation modeling, a procedure that is very similar to confirmatory factor analysis, we can test extremely complex hypotheses about the structure of psychological variables.

This post is a revised version of a tutorial I originally prepared for Cohen & Swerdlik’s Psychological Testing and Assessment: An Introduction To Tests and Measurement

Standard
Cognitive Assessment, Psychometrics, R, Statistics

Bifactor Model in 3D

I was playing around with a Bifactor Model and found no elegant way to do it in 2D. So here is my attempt to do it in 3D:

Bifactor

My code in R:

library(rgl)
library(heplots)
vNorm<-function(x){sqrt(t(x)%*%x)}
vUnit<-function(a,b){(b-a)/vNorm(b-a)}

r3dDefaults$windowRect <- c(10, 40, 700, 700)

open3d()
nBarbs<-20
s1<-c(3,0,6)
s2<-c(12,0,6)
s3<-c(21,0,6)
g<-c(12,0,-6)
o<-c(12,-12, 6)
iDist<- c(0,0,-6)
iSpace<- c(3,0,0)
i1<-cbind(s1-iSpace+iDist,s1+iDist,s1+iSpace+iDist)
for (i in 1:3){shade3d( translate3d( cube3d(col="gray80"), i1[1,i],i1[2,i],i1[3,i]))}

i2<-cbind(s2-iSpace+iDist,s2+iDist,s2+iSpace+iDist)
for (i in 1:3){shade3d( translate3d( cube3d(col="gray60"), i2[1,i],i2[2,i],i2[3,i]))}

i3<-cbind(s3-iSpace+iDist,s3+iDist,s3+iSpace+iDist)
for (i in 1:3){shade3d( translate3d( cube3d(col="gray40"), i3[1,i],i3[2,i],i3[3,i]))}

spheres3d(s1,col="gray80",point_antialias=TRUE,smooth=TRUE)
spheres3d(s2,col="gray60",point_antialias=TRUE,smooth=TRUE)
spheres3d(s3,col="gray40",point_antialias=TRUE,smooth=TRUE)
for (i in 1:3){
arrow3d(s1,i1[,i]+c(0,0,1),color='gray80',n=nBarbs,barblen=0.2,lwd=2)
arrow3d(s2,i2[,i]+c(0,0,1),color='gray60',n=nBarbs,barblen=0.2,lwd=2)
arrow3d(s3,i3[,i]+c(0,0,1),color='gray40',n=nBarbs,barblen=0.2,lwd=2)
arrow3d(o,io[,i]+c(0,0,1),color='gray90',n=nBarbs,barblen=0.2,lwd=2)
}

spheres3d(g,col="black",point_antialias=T,smooth=T)
for (i in 0:8){
arrow3d(g,c(i*3,0,-1),color="black",barlen=0.05,n=nBarbs,barblen=0.15,lwd=2)
#   text3d(x=i*3,y=-1.3,0,paste0("T",i))
}

spheres3d(o,col="gray90",point_antialias=TRUE,smooth=TRUE)
io<-cbind(o-iSpace+iDist,o+iDist,o+iSpace+iDist)
for (i in 1:3){shade3d( translate3d( cube3d(col="gray80"), io[1,i],io[2,i],io[3,i]))}

arrow3d(s1,o-vUnit(s1,o),color='gray80',n=nBarbs,barblen=0.1,lwd=2)
arrow3d(s2,o-vUnit(s2,o),color='gray60',n=nBarbs,barblen=0.1,lwd=2)
arrow3d(s3,o-vUnit(s3,o),color='gray40',n=nBarbs,barblen=0.1,lwd=2)
arrow3d(g,o-vUnit(g,o),color='black',n=nBarbs,barblen=0.1,lwd=2)

text3d(s1+c(0,0,2),texts="S1",font=1,family="serif")
text3d(s2+c(0,0,2),texts="S2",font=1,family="serif")
text3d(s3+c(0,0,2),texts="S3",font=1,family="serif")
text3d(o+c(0,0,2),texts="Outcome",font=1,family="serif")
text3d(g+c(0,0,-2),texts="g",font=3,family="serif")
if (!rgl.useNULL())
play3d(spin3d(axis=c(0,0,1), rpm=10), duration=6)
Standard
Cognitive Assessment, Death Penalty, Psychometrics, Statistics, Tutorial, Uncategorized, Video

Video Tutorial: Misunderstanding Regression to the Mean

One of the most widely misunderstood statistical concepts is regression to the mean. In this video tutorial, I address common false beliefs about regression to the mean and answer the following questions:

  1. What is regression to the mean?
  2. Do variables become less variable each time they are measured?
  3. Does regression to the mean happen all the time or just in certain situations?
  4. Does repeated testing cause people to come closer and closer to the mean?
  5. How is regression to the mean relevant in death penalty cases?

Standard
Statistics

Visualizing Covariance

Correlation? I get it. I have a gut-level sense of what it is. Covariance? Somehow it just eludes me. I mean, I know the formulas and I can give you a conceptual definition of it—but its meaning never really sunk in.

One thing about covariance that always seemed counter-intuitive to me is that covariance between two variables of unequal variance can sometimes be larger than the variance of the variable with less variance. For example, if X has a variance of 9, Y has a variance of 64 and the correlation between X and Y is 0.5, the covariance between X and Y is 12. How can X have a larger covariance with Y than its own variance (i.e., its covariance with itself)? Never made sense to me.

I started using Mathematica recently and decided to make an interactive visualization tool that shows how large covariance between two variables is. Click here (You might be prompted to download the Wolfram CDF Player plugin for your browser. It may take a while to load.). Play with the sliders at the bottom.

The area of the blue square is equal to the variance of X. The area of the red square is equal to the variance of Y. The pink rectangle (which is partially occluded by the purple rectangle) is how large covariance could be if X and Y were perfectly correlated. The area of the purple square is equal to the covariance between X and Y. The ratio of the area of the purple rectangle to the area of the pink rectangle is equal to the correlation between X and Y.

I’m not sure why but this visualization has made me feel better about covariance. It’s like were friends now. 😉

Standard