Skip to Main Content

Stata

A brief introduction to Stata

Methods for Creating and Transforming Variables

Generating variables in Stata is quite simple, especially if you want to generate a new variable from an already existing variable. Researchers often generate new variables that are copies of a current one if they want to change or recode the data, while also keeping the original data so it is not lost. There is no formula for generating a new variable as it is likened to “copy” and “paste”.

Below is the code for generating the variable age1 from an already existing variable age

Code

gen age1= age

 

The commands gen is to generate our new variable age1 from the already existing variable age

Below is the output for generating a new variable that is a copy of already existing data. 

Output

gen age1= age

 

The output is simply the code above. Stata only gives us this output to tell us the code ran correctly and there are no issues. Another way to check is to go “Dataset” window in Stata and look for our new variable age1 as an added column.

We can also generate new variables that are transformed from other variables in the dataset. This is helpful if we want to collapse a variable from a higher level of measurement to a lower level of measurement, such as continuous to categorical.

Below is the code for generating the new variable highschool from the variable education that is recoded from a continuous level to a categorical level.

Code

gen highschool = . 
replace highschool = 0 if education < 12
replace highschool = 1 if education >= 12
replace highschool = . if education == .

 

Our command gen is to generate the variable highschool if the values of education are equal to 0 through 12 are then replace with the value 0 and if the values of education are equal to 12.1 through 20 are then replace with the value 1.

The last line of code is to make sure all missing values in education remain missing in highschool.

Output

The output for this is similar to the previous example. A copy of the code is shown, the number of missing values generated from the gen command  and the number of changes made after each replace function. Stata only gives us this output to tell us the code ran correctly and there are no issues. Another way to check is to go “Dataset” window in Stata and look for our new variable highschool as an added column.

 

. gen highschool = . 
(7,425 missing values generated)

.         replace highschool = 0 if education < 12
(2,539 real changes made)

.         replace highschool = 1 if education >= 12
(4,886 real changes made)

.         replace highschool = . if education == .
(249 real changes made, 249 to missing)

Standardizing a variable from raw values to standard values is often done for variables that do not have a normal distribution. In this case, we are standardizing the variable age in years to Z scores. 

Below is the code that will create a new variable called agestandard which will have the standardized Z scores of age

Code

egen agestandard = std(age)

 

We are doing an extended generation (egen) to create the variable agestandard that is equal to the standardized (std) of age.

Note: we are using the egen function because there is a command used on the right side of the equal sign. Since we are transforming the variable age by standardizing it and setting our new variable agestandard equal to that we need to use the egen function. 

Output

egen agestandard = std(age)

 

Stata only gives us this output to tell us the code ran correctly and there are no issues. Another way to check is to go “Dataset” window in Stata and look for our new variable agestandard as an added column.