Skip to Main Content

R

Sample Pearson Correlation

Correlating variables together is method to test if there is a statistically significant relationship that exists between continuous variables. This is helpful if you want to know if a relationship exists and if we should investigate this relationship further with other statistical tests. There are many tests for correlating continuous variables together and in this guide we will be focusing on sample Pearson Correlation. Sample Pearson correlation is the most commonly used correlation test used.

Below is the formula for the sample Pearson Correlation test.

Formula

$$r_{xy}=\frac{\sum_{i=1}^{n}{(x_i\ -\bar{x})(y_i\ -\bar{y})\ }}{\sqrt{\sum_{i=1}^{n}{(x_i\ -\bar{x})}^2}\sqrt{\sum_{i=1}^{n}{(y_i\ -\bar{y})}^2}}$$

Where, $r_{xy}$, is the Pearson coefficient of the two continuous variables $x$ and $y$. Beginning with the numerator, $\sum_{i=1}^{n}$ is the sum of a specific observation $i$ when equal to 1 and this is repeated for $n$, the number of observations in the sample. Summed is the calculated standard deviation for each observation of the two variables.

Let’s break that down a little more. ${(x}_i\ -\bar{x})$ is the value of a specific observation $x_i$ in our first variable that is subtracted from  $\bar{x}$, the mean of the observations in our first variable $x$. The value is multiplied by $(y_i\ -\bar{y})$, which is done for each observation for the second variable $y$.

The denominator is the calculated variance of the variables $x$ and $y$. That is, it is the square root of  $\sum_{i=1}^{n}$ for a specific observation where $x_i$ is subtracted from the sample mean ($\bar{x}$) and squared. This is repeated for the second variable $y$.

Below is the code for calculating the sample Pearson coefficient. 

Code

cor(SLID[,c("wages", "age")], use="complete.obs")

cor.test(SLID\$wages, SLID\$age, method = "pearson")

 

There are two lines of code above. The first, is cor (correlate) the variables wages and age from the dataset SLID. We are also specifying to use complete.obs (compete observations).

The second line, is to conduct a cor.test (correlation test) of the variables wages and age from the same SLID dataset. We then specify the method of the correlation test as pearson (sample Pearson Correlation test).

Output

 

A

          wages       age
wages 1.0000000 0.3614635
age   0.3614635 1.0000000

 

 

B

 

Pearson's product-moment correlation
data:  SLID\$wages and SLID\$age
t = 24.959, df = 4145, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3347088 0.3876359
sample estimates:
      cor 
0.3614635 

 

A

The output chart above shows us the results from the sample Pearson correlation test between the variables wages and age. The rows are broken into two sections wages and age that show the correlation coefficient and the significance level when each variable is correlated, including itself.

 

Let’ focus on the variable wages, the sample Pearson correlation coefficient of 0.36146 is a positive moderate strong relationship when correlated with age. The coefficient value ranges from 0 to 1. When 0 there is no relationship that exists and 1 is a perfect relationship (this is rare and often a sign for concern) between the variables. The significance value is <0.0001, which is far below our 0.05 threshold. This indicates there is a significant relationship between wages and age in the dataset.