Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

R

Linear Regression

A linear regression is one type of regression test used to analyze the direct association between a dependent variable that must be continuous and one or more independent variable(s) that can be any level of measurement, nominal, ordinal, interval, or ratio. A linear regression tests the changes in the mean of the dependent variable by the predictors included in our model, the independent variable(s).

In the example below, our research question is:

What are the predictors of individuals wage in the dataset?

We are going to include the variables age, sex, education, and language in our model to test the direct association onto wages. Let’s break the variables down a bit more to better understand our linear regression model.

Below is a breakdown of the variables included in our model to help us keep track of the types of variables we are working with.

Dependent variable

Wages of respondent (wages). This is a continuous variable that ranges from a score of 2.30 to 49.92, which is a large range! If you would like to investigate this variable more use the code for the descriptive statistics to better understand the distribution, which is very important for a linear regression model.

Independent variables

  1. Age of respondent in years (age). This is a continuous level variable measuring the age of each respondent.
  2. Sex of respondent (sex). This is a nominal level variable measuring the sex of each respondent and is coded as 1= FEMALE and 2=MALE.
  3. Education of respondent in years (education). This is a continuous level variable measuring the number of years of education each respondent has.
  4. Language of respondent (language). This is a nominal level variable measuring the language that each respondent speaks. Language is coded as 1= English, 2= French, and 3= Other.

Formula

There are two formulas below a general linear regression formula and the specific formula for our example.

 

Formula 1 below, is a general linear regression formula that does not specify our variables and is a good starting place for building a linear regression model.

  1. $Y_i=\ \beta_0+\ \beta_1x_1+\beta_2x_2\ldots+\beta_kx_k+\varepsilon$

 

Formula 2 is specific to our analysis that includes our dependent variable wages and our independent variables age, sex, education, and language.

          2. ${wages}_i=\ \beta_0+\ age{(x}_1)+{sex(x}_2)+{education(x}_3)+language(x_4)\ +\ \varepsilon$

 

In this guide we will focus on formula 2 to further breakdown our linear regression test. Where, ${wages}_i$, is our dependent variable of the model that we are predicting with four independent variables of a specific observation $i$. This is equal to $\beta_0$, the intercept of the model where our regression line intersects with the y axis when $x$ is zero. We can think of $\beta_0$ as our starting wage value of the observations in the dataset. Next, $age(x_1)$, is the variable age multiplied by the calculated regression coefficient that is added to $\beta_0$. The same goes for $sex(x_2)$, $education(x_3)$, and $language(x_4)$ which are the remaining independent variables, sex, education, and language, that are multiplied by the calculated coefficients in the model. Lastly, $\epsilon$, is the error term of the regression formula, which is distance of each point ($i$) to the predicted regression line. We want to minimize this distance between our points and the regression line to have the best fit of our observed points. 

Below is the RStudio code for our linear regression. 

Code

summary(lm(wages~age + sex + education + language, data= SLID))

 

We are creating a summary of the results from our regression. We specify the lm (multiple linear regression) model as wages, the dependent variable, is tested against our independent variables age, sex, education, and language. The observations used in this model come from the SLID dataset.  

Output

A

The first section shows us descriptive statistics of the residuals of the model. Residuals are the predicted values of the independent variables onto the dependent variable. RStudio provides us with the Min (minimum), 1Q (first quartile), Median, 3Q (third quartile), and the Max (maximum) value of the residuals. We can use these to gage how well or not well are independent variables are predicting the dependent variable

 

B

The second section, coefficients:, shows us the results from our regression analysis for each independent variable included. There are six rows of results (Intercept)agesexeducationlanguageFrench, and langaugeOther. The(Intercept) corresponds to our $\beta_0$ in the regression formula, which can be thought of as our ‘starting’ point on the graph.

For the columns, we can see there are the Estimate , which is our unstandardized beta coefficients for each variable, that is often reported in studies and publications. In a multiple linear regression, we can interpret these as a one unit increase in the independent variable is multiplied by our unstandardized beta coefficient to see the change in the dependent variable wages. Lastly, on the right end of the table the column Pr(>|t|), is the significance of each independent variable which indicates if an independent variable is a significant predictor of the wages. We can see that all the independent variables, except for language (p= .6887) is a significant predictor of wages.

Forth, RStudio shows us the results from the ANOVA test. An ANOVA is used to test the statistical significance of the overall regression model telling us if our model is significant or not. We can see the F-statistic of the ANOVA test and is often reported in publications along with the DF or degrees of freedom. The p-value is the statistical significance of the ANOVA test, which we can see is <2.2e-16, far below our .05 threshold. We can interpret this as our regression model is statistically significant and what we are examining ‘matters’.

C

The third section shows us a host of information relating to the regression model. We will break it down line by line. 

First, there is the Residual standard error, which is 6.6 on 3981 degrees of freedom. This tells us that WHAT??

Second, RStudio tells us that there are 3438 observations are deleted due to missing values during listwise deletion. We know there are 7425 observations that exist in the dataset and after deletion 3987 observations are used int he regression analysis.

The third line is our model fit statistics to judge how well our independent variables explain the variance of wages. A Multiple R-squared value of 0.2973 is interpreted as the variables agesexeducation, and language explain 29.73% percent of the variance of individuals wages in this dataset. This is a high value! Although, we need to look at the Adjusted R-squared that accounts for the number of independent variables in our model. Adjusted R-squared is important because the more independent variables we include in our model the higher our R squared value will become. The Adjusted R-squared accounts for this and adjusts for inflation from the number of variables included. We can see the Adjusted R-squared value is slightly lower than the Multiple R-squared, 0.2964.