Skip to Main Content

R

Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves investigating and visualizing data sets to uncover patterns, trends, and relationships. It often utilizes descriptive statistics, such as mean, median, and standard deviation, to summarize and interpret the main features of the data, guiding further analysis and model development.

One way of exploring your data through EDA is looking at the descriptive statistics. Getting the descriptive statistics in RStudio is quick for one or multiple variables. Descriptive statistics are measures we can use to learn more about the distribution of observations in variables for analysis, transforming variables, and reporting. Each descriptive statistic has their own formula that we will not be covering in this guide, but we will walk through the interpretation of each.

For instance, we can get a summary statistics of a variable using summary() function like this:

The output shows us descriptive statistics and missing values. Moving from left to right, we can see the Min. (minimum), 1st Qu (first quartile), Median, Mean, 3rd Qu (third quartile), Max. (maximum), and NA's (missing values).

The minimum value in this dataset is 81, meaning one of the districts in this dataset has only 81 students. The maximum value is 27,176, representing the district with the most students. The average number of students in California districts is 2,634.6. The distribution of this variable is skewed to the right, meaning that the mean is greater than the median (which is 953).

We can also calculate the descriptive statistics for all the variables in one command line.

If there are any character variables (like school, country, and grades), R will simply count the total number of observations and tell you that this is a character variable.