Skip to Main Content

R

Cross Tabulation

A crosstabulation or a contingency table shows the relationship between two or more variables by recording the frequency of observations that have multiple characteristics. Crosstabulation tables show us a wealth of information on the relationship between the included variables. No formula is needed for a crosstabulation, since at a crosstabulation's core it is counts and percentages of observations.

The chi-squared test is often used to accompany a crosstabulation to test if a significant relationship exists and the strength of the relationship between variables.

As a general rule, the dependent variable in a crosstabulation and Chi-squared test is represented in the columns while the independent variable is represented in the rows. In this example, our two variables are sex, the independent variable, and language, the dependent variable. If you want to include other variables, you may simply change sex and language and replace them with another variable in the dataset. 

Formula

$$ \chi^2 = {\sum{{(O_i-E_i)^2}\over E_i}} $$

 

Above is the formula for a Chi-squared test. Where, $\chi^2$ the Greek letter for Chi is squared, equals the sum ($\sum$) in respect to $i$, a specific observation in the dataset of $O_i$, the observed values or the values that actually exist in the dataset. The observed values are subtracted by $E_i$, the expected values when predicted and the residual is squared. Hence the name Chi-squared! The numerator is divided by $E_i$ to calculate our final chi-squared ($\chi^2$) value. 

Below is the code for conducting a crosstabulation and calculating the chi-squared test.

 

There are two lines of code above. The first line of code we are using table() function to get the crosstabulation of county and high_income variables in our dataset.

The second line of code we are conducting a chi-squared test (using chisq.test() function) on the crosstabulation from the dataset. 

Output

A   


B

A

In the output chart R Studio shows the crosstabulation of county by high_income.

B

The second output table, Pearson's chi-squared test, ​we can see that the chi-squared value is 149.32, the degrees of freedom is 44 and the significance level is 2.199e-13 (means it is equivalent to 2.199×10−132.199 \times 10^{-13}2.199×10−13. This represents a very small value, much closer to zero than to one, indicating that the number is 0.0000000000002199). Since we will be using the standard 0.05 or below as out cutoff point for the significance level, we can see that this number is very small and then conclude that there is a statistical significance of the chi-squared test. This means that there is a statistically significant relationship between the variables county and high_income in this dataset.