A linear regression is one type of regression test used to analyze the direct association between a dependent variable that must be continuous and one or more independent variable(s) that can be any level of measurement, nominal, ordinal, interval, or ratio. A linear regression tests the changes in the mean of the dependent variable by the predictors included in our model, the independent variable(s).
In the example below, our research question is:
What are the predictors of individuals wages in the dataset?
We are going to include the variables “age”, “sex”, “education”, and “language” in our model to test the direct association onto “wages”. Let’s break the variables down a bit more to better understand our linear regression model.
Below is a breakdown of the variables included in our model to help us keep track of the types of variables we are working with.
Dependent variable
wages. This is a continuous variable that ranges from a score of 2.30 to 49.92, which is a large range! If you would like to investigate this variable more use the SYNTAX for the descriptive statistics to get the mean, median, mode, and standard deviation to better understand the distribution which is very important for a linear regression model.
Independent variables
Formula
There are two formulas below a general linear regression formula and the specific formula for our example.
Formula 1 below, is a general linear regression formula that does not specify the variables used and is a good starting place for building a linear regression model.
Formula 2 is specific to our analysis that includes our dependent variable “wages” and our independent variables “age”, “sex”, “education”, and “language”.
2. ${wages}_i=\ \beta_0+\ age{(x}_1)+{sex(x}_2)+{education(x}_3)+language(x_4)\ +\ \varepsilon$
In this guide we will focus on formula 2 to further breakdown our linear regression test. Where, ${wages}_i$, is our dependent variable of the model that we are predicting with four independent variables of a specific observation $i$. This is equal to $\beta_0$, the intercept of the model where our regression line intersects with the y axis when $x$ is zero. We can think of $\beta_0$ as our starting wage value of the observations in the dataset. Next, $age(x_1)$, is the variable “age” multiplied by the calculated regression coefficient that is added to $\beta_0$. The same goes for $sex(x_2)$, $education(x_3)$, and $language(x_4)$ which are the remaining independent variables, “sex”, “education”, and “language”, that are multiplied by the calculated coefficients in the model. Lastly, $\epsilon$, is the error term of the regression formula, which is distance of each point ($i$) to the predicted regression line. We want to minimize this distance between our points and the regression line to have the best fit of our observed points.
Below is the SPSS SYNTAX for our linear regression.
SYNTAX
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF R ANOVA
/DEPENDENT wages
/METHOD=ENTER age sex education language.
We use the command REGRESSION to tell SPSS we are building a linear regression model. In the next line, we indicate that we want to use LISTWISE deletion in the /MISSING subcommand and not PAIRWISE deletion in our analysis, the less aggressive deletion method. We are including regression coefficients (COEFF), R and R-squared measures, and an analysis of variance (ANOVA). We indicate our dependent variable next to the /DEPENDENT subcommand and our independent variables in the /METHOD subcommand.
Output
A
In the first output table, “Variables Entered/Removed”, SPSS shows an overall summary of the regression model. The first column “Model” is the number of models we have ran with this set of variables. Since, this is our first and only model it is labeled as 1. The next column, “Variables Entered”, shows us our independent variables we included in the regression model. “Variables Removed” are the variables that we have removed from the regression model and since we are only running one model no variables are excluded. Finally, “Method”, is the column indicating the way in which we have included our independent variables in the regression model. We have “Enter” the variables which is the standard way of input.
B
In the second table below, “Model Summary”, is where we will find the model fit statistics to judge how well our independent variables explain the variance of wages. Included is our “R Squared” (second column from the left) value which is 0.297 and is interpreted as the variables “age”, “sex”, “education”, and “language” that explain 29.70% percent of the variance of individuals wages in this dataset. This is a high value! Although, we need to look at the “Adjusted R Square” that accounts for the number of independent variables in our model. Adjusted R Square is important because the more independent variables we include the higher our R squared value will become. The adjusted R square accounts for this and adjusts for inflation from the number of variables included. We can see in this case the adjusted R Squared value is the same as the R Squared value, 0.297.
C
In the third table, “ANOVA”, SPSS shows us the results from the ANOVA test. An ANOVA is used to test the statistical significance of the overall regression model telling us if our model is significant or not. We will focus on the two most right columns in the table the “F” and “Sig.” columns. In the “F” column, this is the F statistic of the ANOVA results and is often reported in publications. The “Sig.” column is the statistical significance of the ANOVA test, which we can see is .000, far below our .05 threshold. We can interpret this as our regression model is statistically significant and what we are examining ‘matters’.
D
The fourth and final table, “Coefficients”, shows us the results from our regression analysis for each independent variable included. There are five rows of results, “(Constant)”, “age”, “sex”, “education”, and “language”. These correspond to our alpha in the regression formula, which can be thought of as our ‘starting’ point on the graph and the independent variables we included in the model.
For the columns, we can see there are the “Unstandardized Coefficients B”, which is our unstandardized beta coefficients that is most often reported in studies and publications. In a linear regression, we can interpret this as a one unit increase in the independent variable is multiplied by our unstandardized beta coefficient to see the change in the dependent variable “wages”. Lastly, on the right end of the table the column, “Sig.”, is the significance of each independent variable which indicates if an independent variable is a significant predictor of the “wages”. We can see that all independent variables, except for “languages” (p= .689) is a significant predictor of “wages”.