Skip to Main Content

R

Sample Pearson Correlation

Correlating variables together is method to test if there is a statistically significant relationship that exists between continuous variables. This is helpful if you want to know if a relationship exists and if we should investigate this relationship further with other statistical tests. There are many tests for correlating continuous variables together and in this guide we will be focusing on sample Pearson Correlation. Sample Pearson correlation is the most commonly used correlation test used.

Below is the formula for the sample Pearson Correlation test.

Formula

$$r_{xy}=\frac{\sum_{i=1}^{n}{(x_i\ -\bar{x})(y_i\ -\bar{y})\ }}{\sqrt{\sum_{i=1}^{n}{(x_i\ -\bar{x})}^2}\sqrt{\sum_{i=1}^{n}{(y_i\ -\bar{y})}^2}}$$

Where, $r_{xy}$, is the Pearson coefficient of the two continuous variables $x$ and $y$. Beginning with the numerator, $\sum_{i=1}^{n}$ is the sum of a specific observation $i$ when equal to 1 and this is repeated for $n$, the number of observations in the sample. Summed is the calculated standard deviation for each observation of the two variables.

Let’s break that down a little more. ${(x}_i\ -\bar{x})$ is the value of a specific observation $x_i$ in our first variable that is subtracted from  $\bar{x}$, the mean of the observations in our first variable $x$. The value is multiplied by $(y_i\ -\bar{y})$, which is done for each observation for the second variable $y$.

The denominator is the calculated variance of the variables $x$ and $y$. That is, it is the square root of  $\sum_{i=1}^{n}$ for a specific observation where $x_i$ is subtracted from the sample mean ($\bar{x}$) and squared. This is repeated for the second variable $y$.

Below is the code for calculating the sample Pearson coefficient.

There are two lines of code above. The first, is cor() function to correlate the variables students and teachers from the California schools dataset. We are also specifying to use complete.obs (compete observations).

The second line uses a cor.test() function for correlation test of the same variables. We then specify the method of the correlation test as pearson (sample Pearson Correlation test). Alternatively, the method could be kendall or spearman.

Output

The cor() function's output tells us that the correlation between students and teachers is positive and very high (0.9971134). Let’ focus on the cor.test() function's output. Like cor() output, we are able to see the Pearson's correlation coefficient at the bottom, which is a positive strong relationship between number of students and teachers. The coefficient value ranges from 0 to 1. When 0 there is no relationship that exists and 1 is a perfect relationship (this is rare and often a sign for concern) between the variables. However, cor() function cannot tell us whether this relationship is statistically significant or not. There are also additional information. The significance value is very small number, which is far below our 0.05 threshold. This indicates there is a significant relationship between the number of students and teachers in the dataset. We are also able to see the confidence interval, student's t, and degrees of freedom.