Skip to Main Content

Stata

A brief introduction to Stata

Cross Tabulation

A crosstabulation or a contingency table shows the relationship between two or more variables by recording the frequency of observations that have multiple characteristics. Crosstabulation tables show us a wealth of information on the relationship between the included variables. No formula is needed for a crosstabulation, since at a crosstabulation's core it is counts and percentages of observations.

The Chi-squared test is often used to accompany a crosstabulation to test if a significant relationship exists and the strength of the relationship between variables. 

As a general rule, the dependent variable in a crosstabulation and Chi-squared test is represented in the columns while the independent variable is represented in the rows. In this example, our two variables are sex, the independent variable, and language, the dependent variable. If you want to include other variables, you may simply change sex and language and replace them with another variable in the dataset. 

Formula

$$ \chi^2 = {\sum{{(O_i-E_i)^2}\over E_i}} $$

 

Above is the formula for a Chi-squared test. Where, $\chi^2$ the Greek letter for Chi is squared, equals the sum ($\sum$) in respect to $i$, a specific observation in the dataset of $O_i$, the observed values or the values that actually exist in the dataset. The observed values are subtracted by $E_i$, the expected values when predicted and the residual is squared. Hence the name Chi-squared! The numerator is divided by $E_i$ to calculate our final chi-squared ($\chi^2$) value. 

Below is the code for conducting a crosstabulation and calculating the Chi-squared test.

Code

tab sex language, row col chi

 

Our command tab is used to produce a contingency table of the variables sex and language. We are specifying the percentages of the rows and column with the options row and col. Lastly, we can include the Chi-squared test into the code with the option chi.

 

Output

 

undefined


 

A

In the output chart Stata shows the crosstabulation of sex by language. We can see that sex is first in the code and appears in rows while language is written second and appears in the columns. In the code, we also specified the cells to include row and col which are the percentages of the observations of the total sample size for this analysis.  

B

On the bottom of the crosstabulation chart Stata gives us the results from the Chi-squared test. We can see Stata uses the Pearson Chi-squared test (Pearson chi2) which includes the degrees of freedom in parentheses, the calculated Chi-squared value, and the Pearson r coefficient (Prwhich is the two tailed significance level.

We can see that the Chi-squared value is 0.244, the degrees of freedom is 2 and the significance level is 0.885. Since we are using 0.05 or below as our cutoff point for the significance level, we can see that 0.885 is very much above 0.05 and we then conclude there is no statistical significance of the Chi-squared test. This means that there is no statistically significant relationship between the variables sex and language in this dataset.