A linear regression is one type of regression test used to analyze the direct association between a dependent variable that must be continuous and one or more independent variable(s) that can be any level of measurement, nominal, ordinal, interval, or ratio. A linear regression tests the changes in the mean of the dependent variable by the predictors included in our model, the independent variable(s).
In the example below, our research question is:
What are the predictors of individuals wage in the dataset?
We are going to include the variables age, sex, education, language in our model to test the direct association onto wages. Let’s break the variables down a bit more to better understand our linear regression model.
Below is a breakdown of the variables included in our model to help us keep track of the types of variables we are working with.
Dependent variable
Wages of respondent (wages). This is a continuous variable that ranges from a score of 2.30 to 49.92, which is a large range! If you would like to investigate this variable more use the code for the descriptive statistics to better understand the distribution, which is very important for a linear regression model.
Independent variables
Formula
There are two formulas below a general linear regression formula and the specific formula for our example.
Formula 1 below, is a general linear regression formula that does not specify our variables and is a good starting place for building a linear regression model.
Formula 2 is specific to our analysis that includes our dependent variable wages and our independent variables age, sex, education, and language.
2. ${wages}_i=\ \beta_0+\ age{(x}_1)+{sex(x}_2)+{education(x}_3)+language(x_4)\ +\ \varepsilon$
In this guide we will focus on formula 2 to further breakdown our linear regression test. Where, ${wages}_i$, is our dependent variable of the model that we are predicting with four independent variables of a specific observation $i$. This is equal to $\beta_0$, the intercept of the model where our regression line intersects with the y axis when $x$ is zero. We can think of $\beta_0$ as our starting wage value of the observations in the dataset. Next, $age(x_1)$, is the variable age multiplied by the calculated regression coefficient that is added to $\beta_0$. The same goes for $sex(x_2)$, $education(x_3)$, and $language(x_4)$ which are the remaining independent variables, sex, education, and language, that are multiplied by the calculated coefficients in the model. Lastly, $\epsilon$, is the error term of the regression formula, which is distance of each point ($i$) to the predicted regression line. We want to minimize this distance between our points and the regression line to have the best fit of our observed points.
Below is the Sas code for our linear regression.
Code
PROC REG DATA = SLID;
MODEL wages = age sex education language;
RUN;
We are doing the PROC (procedure) REG (linear regression) of the DATA SLID. Our MODEL is the depredate variable wages is equal to the independent variables age sex education language. We then end with the RUN command.
Output
A
In the first output section to the right, Sas provides a summary of the observations in our regression model. We can see there are 7425 observations that exist in the dataset and after listwise deletion for the number of observations with missing values, 3438 observations remain. The Number of Observations Used, 3987, is the number of observations included in the regression model.
B
In the second table, “Analysis of Variance”, Sas shows us the results from the ANOVA test. An ANOVA is used to test the statistical significance of the overall regression model telling us if our model is significant or not. We will focus on the two most right columns in the table the F Value and Pr > F columns. In the F Value column, this is the F statistic of the ANOVA results and is often reported in publications. The Pr > F column is the statistical significance of the ANOVA test, which we can see is <.0001, far below our .05 threshold. We can interpret this as our regression model is statistically significant and what we are examining ‘matters’.
C
The third table shows us our model fit statistics to judge how well our independent variables explain the variance of wages. The column to the right with two rows labeled, R-Squared and Adj R-Sq, are used to judge our model fit. A R-Squared value of 0.297 is interpreted as the variables age, sex, education, and language explain 29.70% percent of the variance of individuals wages in this dataset. This is a high value! Although, we need to look at the Adj R-Sq that accounts for the number of independent variables in our model. Adj R-Sq is important because the more independent variables we include in our model the higher our R squared value will become. The Adj R-Sq accounts for this and adjusts for inflation from the number of variables included. We can see in this case the Adj R-Sq value is the same as the R-Squared, 0.297.
D
The forth table, Parameter Estimates, shows us the results from our regression analysis for each independent variable included. There are five rows of results Intercept, age, sex, education, and language. The Intercept corresponds to our $\beta_0$ in the regression formula, which can be thought of as our ‘starting’ point on the graph.
For the columns, we can see there are the Parameter Estimate, which is our unstandardized beta coefficients for each variable, that is often reported in studies and publications. In a linear regression, we can interpret these as a one unit increase in the independent variable is multiplied by our unstandardized beta coefficient to see the change in the dependent variable wages. Lastly, on the right end of the table the column, Pr > |t|, is the significance of each independent variable which indicates if an independent variable is a significant predictor of the wages. We can see that all the independent variables, except for language (p= .6887) is a significant predictor of wages.