Skip to main content

Stata: Stata 1 Workshop

A brief introduction to Stata

Data to Download

Throughout this research guide GSS 2016 Data will be used for all analysis allowing you to check your steps of analysis. 

The General Social Survey is a great set of social indicators to practice analysis techniques while looking at topics of interest for scientists. 

"Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spendi‚Äčng priorities, crime and punishment, intergroup relations, and confidence in institutions." General Social Survey, 2018 Website

Website for more information:


Brief presentation about Stata's abilities and links of places to go for Stata help.


The final output from the workshop today!


Do file of all code from the workshop. 
Note: You should download this file and save it to the desktop. You can only open it within Stata. Do not try to open it like a regular document as it will not work!

****** Stata 1:Introduction to Stata ******
***** Overview *****
* This do file will walk you through opening data, cleaning data, and basic 
* analysis within Stata. The methods shown here are my preference and there are 
* multiple approaches to accomplish the same goals. 
* To run code you hilight the code you wish to run and hit Control+d
* NOTE: This do file should not be distributed without the written 
* permission of Raeda Anderson, Ph.D 
******** Getting to Know Your Data ********
***** Open Data File  *****
* Opening a data file is a rather simple process. You simply use the command 
* use with the data file location in " ". Be sure to include.DTA
* NOTE: the following code will not work until you update the path location
use "C:\Users\randerson39\Documents\Stata Crash Course\GSS2016.DTA"
* use "data path file"
***** Looking at the Codebook Information for a Variable *****
* If the data within Stata is complete, there will be information on the 
* variable properties (similar to the information that would be contained 
* within a codebook) 
codebook cappun degree 
*codebook VariableName1 VariableName2
** Freq. Distributions & Crostabulations **
***** Running a Frequency Distribution *****
* Running a frequency distribution which contains the frequency, percents, 
* valid percents, and cumulative percent. 
tab cappun
* tab VariableName
tab1 cappun degree
* tab1 VariableName1 VariableName2 VariableNameN
***** Running a Crosstabulation *****
* Crosstabulation is a basic analysis generally conducted with two variables
* to roughly estimate the pattern between given variables 
tab cappun degree
* tab VariableName1 VariableName2
tab cappun degree, chi
*tab VariableName1 VariableName2, chi
tab cappun degree, col
*tab VariableName1 VariableName2, col 
tab cappun degree, chi col 
* tab VariableName1 VariableName2, chi col 
****** Generating/Labeling Variables ******
***** Generating a Variable - No pre-existing variable *****
* In your analysis you may need to generate a variable that is constant across
* all respondents. I most commonly generate a variable like this to indicate
* which wave of data this data file contains before merging databases
gen wave1=1 
* gen VariableName=Value
***** Generating a Variable- Equal to a pre-exiting variable *****
* If you need to make a copy of a variable that is a exact duplicate of an 
* existing variable use the following code. I often use this option to generate
* a variable that I can later manipulate (collapse, take the average, etc)
gen overallhappy = happy 
* gen NewVariableName = OldVariableName 
***** Generating a Variable- Changing a pre-existing variable *****
* We often use variables in a different form from a survey/database and our 
* analysis. One of the easiest ways to make this changes is with an egen or 
* gen coding format. 
* CALCULATED VARIABLE- I use this most frequently with age. As someone 
* who studies older adults I find it important to discuss their age as 
* 'one year older' or something similar. So, I am going to walk you 
* through how to do just that. 
* First we need to find out the minimum age of respondents in the data.
tab age 
* Second we need to generate a new variable of age where the youngest
* person is 0 years old. 
gen newage = age-18
* gen NewVariableName = OldVariable - amount
*Note: Stata will allow you to use common mathematical symbols such 
* such as the following 
* addition: gen NewVar1 = OldVar1+OldVar2
* subtraction: gen NewVar1 = OldVar1-OldVar2
* multiplication: gen NewVar1 = OldVar1*OldVar2
* division: gen NewVar1 = OldVar1/OldVar2
* VARIABLE USING SPECIFIC SUBGROUP- I use this most frequently when I 
* need to analyze a group of people within a study. For this example we 
* are going to generate variables that represent (1) females, (2) black 
* people, and (3) black females
* Generating a variable for female 
* First we need to look at the codebook for gender so we know how
* the variable is coded so it can be edited 
codebook sex
* From the codebook we see that females are '2' and males are '1'
* to generate the female variable we are going to use "if" coding. 
* We will say if sex is equal to 2, then we want female to be equal
* to 1. If sex is equal to 1, then we want female to be equal to 0. 
* The result will be a dummy variable where female=1 and male=0.
gen female=1 if sex==2
replace female=0 if sex==1
* gen NewVariableName = value if OldVariableName == value
* replace NewVariableName = value if OldVariableName == value
tab female 
* Generating a variable for black people
* First we need to look at the codebook for race so we know how
* the variable is coded so it can be edited 
codebook race
* From the codebook we see that black is equal to 2, white is 
* equal to 1, and other races are equal to 3. Thus we need to 
* generate a variable where 1= black and 0= all other races. 
gen black=1 if race==2
replace black=0 if race==1
replace black=0 if race==3
* gen NewVariableName = value if OldVariableName == value
* replace NewVariableName = value if OldVariableName == value
* replace NewVariableName = value if OldVariableName == value
tab black 
* Generating a variable for black women 
* This is done by generating a interaction variable of women and 
* black respondents. 
gen blackfemale = black*female
* gen NewVariableName = OldVariableName1*OldVariableName2
tab blackfemale
***** Labeling Variables *****
* We have generated the following new variables that need labels
* newage - respodent age 0=18, 1=19, 2=20, etc. 
* female - female=1, male=0
* black- black=1, other race=0
* blackfemale- black female=1, non black female=0
* new age
* for data where the response is a number we only need to 
* label the variable
label variable newage "age with 0=18 years old"
* label variable Variable1 "label for new variable"
tab newage
* female, black, black female
* for data where the response are words we need to generate variable
* labels so when we run analysis we know what each value represents
* this is a two step process (1) label the variable (2) label the values
label variable female "Female- Dummy Variable"
label variable black "Black- Dummy Variable"
label variable blackfemale "Black Female- Dummy Variable"
* label variable Variable1 "label for new variable"
label define female1 0 "male" 1 "female"
label values female female1
label define black1 0 "minority or white" 1 "black"
label values black black1
label define blackfemale1 0 "non-black minority, white, and/or male" 1 "black female"
label values blackfemale blackfemale1
* label define VariableForLabel 0 "what 0 represents" 1 "what 1 
* represents" 
* label values Variable1 VariableForLabel
tab1 female black blackfemale