Skip to Main Content

Stata

A brief introduction to Stata

Sample Pearson Correlation

Correlating variables together is method to test if there is a statistically significant relationship that exists between continuous variables. This is helpful if you want to know if a relationship exists and if we should investigate this relationship further with other statistical tests. There are many tests for correlating continuous variables together and in this guide we will be focusing on sample Pearson Correlation. Sample Pearson correlation is the most commonly used correlation test used and is the default for Stata.

Below is the formula for the sample Pearson Correlation test.

Formula

$$r_{xy}=\frac{\sum_{i=1}^{n}{(x_i\ -\bar{x})(y_i\ -\bar{y})\ }}{\sqrt{\sum_{i=1}^{n}{(x_i\ -\bar{x})}^2}\sqrt{\sum_{i=1}^{n}{(y_i\ -\bar{y})}^2}}$$

Where, $r_{xy}$, is the Pearson coefficient of the two continuous variables $x$ and $y$. Beginning with the numerator, $\sum_{i=1}^{n}$ is the sum of a specific observation $i$ when equal to 1 and this is repeated for $n$, the number of observations in the sample. Summed is the calculated standard deviation for each observation of the two variables.

Let’s break that down a little more. ${(x}_i\ -\bar{x})$ is the value of a specific observation $x_i$ in our first variable that is subtracted from  $\bar{x}$, the mean of the observations in our first variable $x$. The value is multiplied by $(y_i\ -\bar{y})$, which is done for each observation for the second variable $y$.

The denominator is the calculated variance of the variables $x$ and $y$. That is, it is the square root of  $\sum_{i=1}^{n}$ for a specific observation where $x_i$ is subtracted from the sample mean ($\bar{x}$) and squared. This is repeated for the second variable $y$.

Below is the code for calculating the sample Pearson coefficient. 

Code

pwcorr wages age education, sig

 

The command pwcorr is to calculate the sample Pearson coefficient of the variables wages, age, and education including the sig (significance) results of the test. 

Output

 

undefined

 

A

The output chart above shows us the results from the sample Pearson correlation test between the variables wages, age, and education. The rows are broken into three sections wages, age, and education that show the correlation coefficient and the significance level when each variable is correlated, including itself.

Let’ focus on the variable wages, the sample Pearson correlation coefficient of 0.362 is a positive moderate strong relationship when correlated with age. The coefficient value ranges from 0 to 1. When 0 there is no relationship that exists and 1 is a perfect relationship (this is rare and often a sign for concern) between the variables. The significance value is 0.000, which is far below our 0.05 threshold. This indicates there is a significant relationship between wages and age in the dataset.