Making charts and graphs in R Studio is easy! We will provide examples using both base R
and ggplot2
, a popular package from the tidyverse
. Charts and graphs are effective ways to represent data to an audience, and ggplot2
offers powerful and flexible tools for creating visually appealing plots.
Histograms are best to plot continuous level variables because, as the name suggests, the values are on a continuum. Histograms are very helpful for investigating the distribution of continuous variables which is important for determining if a variable needs to be recoded.
Code
We can create histograms either through base R or ggplot2 package.
hist()
function we are plotting a distribution of expenditure
variable.tidyverse
, we use ggplot()
and geom_histogram()
functions to create the same graph.ggplot()
function enables us to customize our plots. For instance, we were able to change the count of bins, added a theme (theme_bw()
function), and change the labels of the x-axis and y-axis using labs()
function.Output from Base R
Output from ggplot()
Output from ggplot()
- improved version
The histogram shows us the range of ages among the observations and the frequency of occurrence. We can also see that the distribution of expenditure
does not follow a normal curve (it is closer to normal curve, but it is not normal) and is skewed to the right. This may effect our results of our earlier statistical tests.
Boxplots, often called box-and-whisker plots and are used to represent the quartiles of continuous level variables. Boxplots display the variation in the sample with boxes that represent the quartiles and 'whiskers' of observations outside the upper and lower quartiles. These plots can be done with a single variable or multiple variables, as we will see below.
Code
We can create boxplots either through base R or ggplot2 package.
boxplot()
function we are plotting a distribution of expenditure
variable.tidyverse
, we use ggplot()
and geom_boxplot()
functions to create the same graph.ggplot()
function enables us to customize our plots. For instance, we were able to add a theme (theme_bw()
function), and change the labels of the x-axis and y-axis using labs()
function.expenditure
with a horizontal line inside the gray box. The top and bottom edges of the gray box are the 25 (Q1) and 75 (Q3) quartiles of the distribution. Next, the whiskers are the minimum and maximum values recorded for expenditure
of the observations. Dots are outliers.
ggplot()
ggplot()
- improved versionWe can also create a boxplot of expenditure
variable by other variables. For instance, we can graph expenditure
by two counties in county
variable.
This code might look intimidating at first. However, each step helps us to configure a specific aspect of the plot:
filter()
function helps us to filter county variable into only two options: Sonoma and Mercedgeom_boxplot()
function creates a boxplot of expenditure by countytheme_bw()
function creates black-and-white theme for the plotlabs()
function changes the x-axis and y-axis namescoord_flip()
function flips the coordinates x and y scale_x_continuous()
function helps us to change how x-axis scale looks like
breaks
argument with seq()
function helps to alter the x-axis tickslimits
argument helps us to alter the limits of the x-axis (lower and upper limits)This box plot is separated by the two counties (Merced and Sonoma) and expenditure
is represented in the y-axis. This helps us to see the distribution of expenditure
by county
.
Bar plots are bested used to represent ordinal level variables to show the distribution of the options. We can graph a bar plot of a single variable or multiple variables for a direct comparison.
We can create bar plots either through base R or ggplot2 package.
barplot()
function we are plotting a distribution of grades variable.tidyverse
, we use ggplot()
and geom_bar()
functions to create the same graph.ggplot()
function enables us to customize our plots. For instance, we were able to add a theme (theme_bw()
function), and change the labels of the x-axis and y-axis using labs()
function.
ggplot()
ggplot()
- improved versionThe bar plots above show the raw count of observations of the variable grades
broken up by the observations. We can clearly see that there are more KK-08 grades than KK-06 grades in the dataset.
This code might look intimidating at first. However, each step helps us to configure a specific aspect of the plot:
filter()
function helps us to filter county variable into only two options: Sonoma and Mercedgeom_bar()
function creates a boxplot of expenditure by county
fill
and color
arguments help us to fill and color our bar plot by county variabletheme_minimal()
function creates a minimal theme for the plotlabs()
function changes the x-axis and y-axis namescoord_flip()
function flips the coordinates x and y scale_y_continuous()
function helps us to change how y-axis scale looks like
breaks
argument with seq()
function helps to alter the y-axis tickslimits
argument helps us to alter the limits of the y-axis (lower and upper limits)
We have broken the observations by grades (KK-06 and KK-08) and the county (Merced and Sonoma district).
Scatter plots are best used to graphically show if there is a relationship between two variables and what that relationship may look like.
We can create bar plots either through base R or ggplot2 package.
plot()
function we are plotting a distribution of grades variable.tidyverse
, we use ggplot()
and geom_point()
functions to create the same graph.ggplot()
function enables us to customize our plots. For instance, we were able to add a theme (theme_bw()
function), and change the labels of the x-axis and y-axis using labs()
function, and even add a regression line using geom_smooth()
function.
ggplot()
ggplot()
- improved versionAbove are scatter plots of the variables students
by teachers
. Scatter plots are very helpful when examining continuous level variables and if a graphical relationship exists. We can see in this scatter plot that there is a linear and positive relationship between the number of students and teachers. After looking as this graph, we would next want to conduct statistical tests to see if the relationships is statically significant.