Skip to Main Content

R

Linear Regression

A linear regression is a type of regression analysis used to examine the direct association between a continuous dependent variable and one or more independent variables, which can be measured at any level (nominal, ordinal, interval, or ratio). This method assesses how changes in the mean of the dependent variable are influenced by the predictors included in the model.

In the example below, our research question is:

What are the predictors of the average math score in the dataset?

We will include the variables teachers (number of teachers), income (district average income (in USD 1,000)), english (percent of English learners), and computers (number of computers) in our model to test their direct association with the math (average math score). Let’s break these variables down a bit more to better understand our linear regression model.

Below is a breakdown of the variables included in our model to help us keep track of the types of variables we are working with.

Dependent variable

Average math score of respondent (math). This is a continuous variable that ranges from a score of 605.4 to 709.5. If you would like to investigate this variable more use the code for the descriptive statistics to better understand the distribution, which is very important for a linear regression model.

Independent variables

  1. Number of teachers (teachers). This is a continuous level variable counting the number of teachers in each school.
  2. Average income (income). This is a continuous level variable measuring the average income in each district.
  3. English (english). This is a continuous level variable measuring the percent of English learners.
  4. Computers (computers). This is a continuous level variable measuring the number of computers in each school.

Formula

There are two formulas below a general linear regression formula and the specific formula for our example.

Formula 1 below, is a general linear regression formula that does not specify our variables and is a good starting place for building a linear regression model.

$Y_i=\ \beta_0+\ \beta_1x_1+\beta_2x_2\ldots+\beta_kx_k+\varepsilon$

Formula 2 is specific to our analysis that includes our dependent variable math and our independent variables teachers, income, english, and computer.

${math}_i=\ \beta_0+\ \beta_1{teachers}_1 + \beta_2{income}_2 + \beta_3{english}_3 + \beta_4{computer}_4\ +\ \varepsilon$

In this guide we will focus on formula 2 to further breakdown our linear regression test. Where, ${math}_i$, is our dependent variable of the model that we are predicting with four independent variables of a specific observation $i$. This is equal to $\beta_0$, the intercept of the model where our regression line intersects with the y axis when $x$ is zero. We can think of $\beta_0$ as our starting math value of the observations in the dataset. Next, $\beta_1{teachers}_1$, is the variable teachers multiplied by the calculated regression coefficient that is added to $\beta_0$. The same goes for $\beta_2{income}_2$, $\beta_3{english}_3$, and $\beta_4{computer}_4$ which are the remaining independent variables, income, english, and computer, that are multiplied by the calculated coefficients in the model. Lastly, $\epsilon$, is the error term of the regression formula, which is distance of each point ($i$) to the predicted regression line. We want to minimize this distance between our points and the regression line to have the best fit of our observed points. 

We can run a linear regression model using lm() function. When specifying our model, we start by listing our dependent variable first. Before listing our independent variables, we use tilda sign . For each independent variable, we use plus sign $+$ between our variables. After listing our independent variables, we also specify the datasets we are working.

After creating our model, we can use summary() function to get the results from our regression.

Output

A

The first section shows us descriptive statistics of the residuals of the model. Residuals are the predicted values of the independent variables onto the dependent variable. RStudio provides us with the Min (minimum), 1Q (first quartile), Median, 3Q (third quartile), and the Max (maximum) value of the residuals. We can use these to gage how well or not well are independent variables are predicting the dependent variable.

B

The second section, coefficients, shows us the results from our regression analysis for each independent variable included. There are five rows of results Interceptteachersincome, englishand computer. The Intercept corresponds to our $\beta_0$ in the regression formula, which can be thought of as our ‘starting’ point on the graph.

For the columns, we can see there are the Estimate, which is our unstandardized beta coefficients for each variable, that is often reported in studies and publications. In a multiple linear regression, we can interpret these as a one unit increase in the independent variable is multiplied by our unstandardized beta coefficient to see the change in the dependent variable math. Lastly, on the right end of the table the column Pr(>|t|), is the significance of each independent variable which indicates if an independent variable is a significant predictor of the math. We can see that income and english independent variables are statistically significant (below 0.05 threshold). While teachers (p-value = 0.169) and computer (p-value = 0.105) is not a significant predictor of math.

Fourth, RStudio shows us the results from the ANOVA test. An ANOVA is used to test the statistical significance of the overall regression model telling us if our model is significant or not. We can see the F-statistic of the ANOVA test and is often reported in publications along with the DF or degrees of freedom. The p-value is the statistical significance of the ANOVA test, which we can see is <2.2e-16, far below our .05 threshold. We can interpret this as our regression model is statistically significant and what we are examining "matters."

C

The third section provides a wealth of information related to the regression model. We will break it down line by line.

First, there is the Residual Standard Error, which is 11.47 on 415 degrees of freedom. This value tells us the average distance that the observed values fall from the regression line, indicating the typical size of the residuals (errors).

Next, we look at our model fit statistics to judge how well our independent variables explain the variance of the dependent variable, which is the average math score. The Multiple R-squared value of 0.6298 means that the variables teachers, income, english, and computer explain 62.98 percent of the variance in the average math score in this dataset. However, we also need to consider the Adjusted R-squared value, which is important because it adjusts for the number of independent variables included in our model. The more independent variables we include, the higher our R-squared value can become, even if those variables don't actually improve the model. The Adjusted R-squared accounts for this potential inflation and provides a more accurate measure of model fit. In this case, the Adjusted R-squared value is slightly lower than the Multiple R-squared, at 0.6262, reflecting a more realistic estimate of how well our independent variables explain the variance in the average math score.