Correlating variables together is method to test if there is a statistically significant relationship that exists between continuous variables. This is helpful if you want to know if a relationship exists and if we should investigate this relationship further with other statistical tests. There are many tests for correlating continuous variables together and in this guide we will be focusing on sample Pearson Correlation. Sample Pearson correlation is the most commonly used correlation test used and is the default for Sas.
Below is the formula for the sample Pearson Correlation test.
Formula
$$r_{xy}=\frac{\sum_{i=1}^{n}{(x_i\ -\bar{x})(y_i\ -\bar{y})\ }}{\sqrt{\sum_{i=1}^{n}{(x_i\ -\bar{x})}^2}\sqrt{\sum_{i=1}^{n}{(y_i\ -\bar{y})}^2}}$$
Where, $r_{xy}$, is the Pearson coefficient of the two continuous variables $x$ and $y$. Beginning with the numerator, $\sum_{i=1}^{n}$ is the sum of a specific observation $i$ when equal to 1 and this is repeated for $n$, the number of observations in the sample. Summed is the calculated standard deviation for each observation of the two variables.
Let’s break that down a little more. ${(x}_i\ -\bar{x})$ is the value of a specific observation $x_i$ in our first variable that is subtracted from $\bar{x}$, the mean of the observations in our first variable $x$. The value is multiplied by $(y_i\ -\bar{y})$, which is done for each observation for the second variable $y$.
The denominator is the calculated variance of the variables $x$ and $y$. That is, it is the square root of $\sum_{i=1}^{n}$ for a specific observation where $x_i$ is subtracted from the sample mean ($\bar{x}$) and squared. This is repeated for the second variable $y$.
Below is the code for calculating the sample Pearson coefficient.
Code
PROC CORR DATA=SLID;
VAR wages age;
RUN;
We are conducting a PROC (procedure) to calculate the CORR (correlation) using the DATA SLID for the VAR (variable) wages and age. We then end with the RUN command.
Output
A
The output chart above shows us the results from the sample Pearson correlation test between the variables wages and age. The rows are broken into two sections wages and age that show the correlation coefficient and the significance level when each variable is correlated, including itself.
Let’ focus on the variable wages, the sample Pearson correlation coefficient of 0.36146 is a positive moderate strong relationship when correlated with age. The coefficient value ranges from 0 to 1. When 0 there is no relationship that exists and 1 is a perfect relationship (this is rare and often a sign for concern) between the variables. The significance value is <0.0001, which is far below our 0.05 threshold. This indicates there is a significant relationship between wages and age in the dataset.