Skip to Main Content

Stata

A brief introduction to Stata

Linear Regression

A linear regression is one type of regression test used to analyze the direct association between a dependent variable that must be continuous and one or more independent variable(s) that can be any level of measurement, nominal, ordinal, interval, or ratio. A linear regression tests the changes in the mean of the dependent variable by the predictors included in our model, the independent variable(s).

In the example below, our research question is:

What are the predictors of individuals wage in the dataset?

We are going to include the variables age, sex, education, and language in our model to test the direct association onto wages. Let’s break the variables down a bit more to better understand our linear regression model.

Below is a breakdown of the variables included in our model to help us keep track of the types of variables we are working with.

Dependent variable

Wages of respondent (wages). This is a continuous variable that ranges from a score of 2.30 to 49.92, which is a large range! If you would like to investigate this variable more use the code for the descriptive statistics to better understand the distribution, which is very important for a linear regression model.

Independent variables

  1. Age of respondent in years (age). This is a continuous level variable measuring the age of each respondent.
  2. Sex of respondent (sex). This is a nominal level variable measuring the sex of each respondent and is coded as 1= FEMALE and 2=MALE.
  3. Education of respondent in years (education). This is a continuous level variable measuring the number of years of education each respondent has.
  4. Language of respondent (language). This is a nominal level variable measuring the language that each respondent speaks. Language is coded as 1= English, 2= French, and 3= Other.

Formula

There are two formulas below a general linear regression formula and the specific formula for our example.

 

Formula 1 below, is a general linear regression formula that does not specify our variables and is a good starting place for building a linear regression model.

  1. $Y_i=\ \beta_0+\ \beta_1x_1+\beta_2x_2\ldots+\beta_kx_k+\varepsilon$

 

Formula 2 is specific to our analysis that includes our dependent variable wages and our independent variables age, sex, education, and language.

          2. ${wages}_i=\ \beta_0+\ age{(x}_1)+{sex(x}_2)+{education(x}_3)+language(x_4)\ +\ \varepsilon$

 

In this guide we will focus on formula 2 to further breakdown our linear regression test. Where, ${wages}_i$, is our dependent variable of the model that we are predicting with four independent variables of a specific observation $i$. This is equal to $\beta_0$, the intercept of the model where our regression line intersects with the y axis when $x$ is zero. We can think of $\beta_0$ as our starting wage value of the observations in the dataset. Next, $age(x_1)$, is the variable age multiplied by the calculated regression coefficient that is added to $\beta_0$. The same goes for $sex(x_2)$, $education(x_3)$, and $language(x_4)$ which are the remaining independent variables, sex, education, and language, that are multiplied by the calculated coefficients in the model. Lastly, $\epsilon$, is the error term of the regression formula, which is distance of each point ($i$) to the predicted regression line. We want to minimize this distance between our points and the regression line to have the best fit of our observed points. 

Below is the Stata code for our linear regression. 

Code

regress wage age sex education language

 

We use the command regress to tell Stata we are building a linear regression model. We indicate our dependent variable as wages by ordering it first in the list of variables. All variables that follow after wages are our independent variables in the specified order of age, sex, education, and language.

Output

undefined

 

A

In the first output section to the right, Stata provides an overall summary of our regression model. We find the model fit statistics to judge how well our independent variables explain the variance of wages. Starting from the top row and moving down, we will go through each line of this section.

First, we have 3,987 observations included in this analysis of listwise deletion. 

Next, we have the results from the ANOVA test that is used to test the statistical significance of the overall regression model indicating if our model is significant or not. The F statistic (421.09) and degrees of freedom (4) are included in the second row with the significance (0.0000) reported below that. The significance is the statistical significance of the ANOVA test, which we can see is 0.0000, far below our .05 cutoff point. We can interpret this as our regression model is statistically significant and what we are examining ‘matters’.

The next two lines labeled, R-sqaured and Adj R-squared, are used to judge our model fit. A R-sqaured value of 0.297 is interpreted as the variables age, sex, education, and language explain 29.70% percent of the variance of individuals wages in this dataset. This is a high value! Although, we need to look at the Adj R-squared that accounts for the number of independent variables in our model. Adj R-squared is important because the more independent variables we include in our model the higher our R squared value will become. The Adj R-squared accounts for this and adjusts for inflation from the number of variables included. We can see in this case the Adj R-squared value is the same as the R-sqaured, 0.297.

 

B

The second section of the output shows the calculated model fit measures such as SS (sum of squares) for the Model and the Residual (the amount of error in the model). These metrics are helpful in understanding the regression line in comparison to the data points. Reporting these are discipline specific and we will not go through these as they are not always used. 

 

C

The bottom table shows us the results from our regression analysis for each independent variable included. There are five rows of results, age, sex, educationlanguage, and _cons (constant, which is the intercept in the formula above). The _cons corresponds to our $\beta_0$ in the regression formula, which can be thought of as our ‘starting’ point on the graph.

For the columns, we can see there are the Coef., which is our unstandardized beta coefficients, that is often reported in studies and publications. In a linear regression, we can interpret these as a one unit increase in the independent variable is multiplied by our unstandardized beta coefficient to see the change in the dependent variable wages. Lastly, on the right end of the table the column, P>|t| is the significance of each independent variable which indicates if an independent variable is a significant predictor of the wages. We can see that all the independent variables, except for language (p= .689) is a significant predictor of wages.