Skip to main content

Stata: Stata 2 Workshop

A brief introduction to Stata

Data to Download

Throughout this research guide GSS 2016 Data will be used for all analysis allowing you to check your steps of analysis. 

The General Social Survey is a great set of social indicators to practice analysis techniques while looking at topics of interest for scientists. 

"Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spendi‚Äčng priorities, crime and punishment, intergroup relations, and confidence in institutions." General Social Survey, 2018 Website

Website for more information: http://gss.norc.org/

Output

The final output from the workshop today!

Code

Do file of all code from the workshop. 
Note: You should download this file and save it to the desktop. You can only open it within Stata. Do not try to open it like a regular document as it will not work!

*******************************************
*******************************************
*******************************************
********* Stata 2: Data Analysis **********
*******************************************
*******************************************
*******************************************
 
***** Overview *****
* This do file will walk you through a quick overview of commands that you 
* should know (frequency distributions and summary statistics, 
* generating variables (advanced), analysis (chi-squared, ANOVA, and 
* regression), graphs, and navigation of help features within Stata. 
* The methods shown here are my preference and there are multiple 
* approaches to accomplish the same goals. 
 
* To run code you hilight the code you wish to run and hit Control+d
 
* NOTE: This do file should not be distributed without the written 
* permission of Raeda Anderson, Ph.D 
***** Summary Statistics *****
* Summary statistics are the basic descriptive statistics for each 
* individual variable. Summary statistics vary across statistical platforms
* (i.e. SPSS, Stata, and SAS do not have the same summary statistics) but 
* Stata includes number of observations, mean, standard deviation, 
* minimum, and maximum 
* Summary statistics for one variable
sum age
*sum VariableName
* Summary statistics for multiple variables 
sum age colrac dwelown income marhomo 
* sum VariableName1 VariableName2 VariableName3 VariableName4
* When you're running summary statistics I strongly recommend also running
* frequency distributions so you know what the variable values are and 
* the wording of the questions
* Frequency Distribution 
tab1 age colrac dwelown income marhomo 
* tab1 VariableName1 VariableName2 VariableName3 VariableName4
 
 
*******************************************
***** Generating Variables (Advanced) *****
*******************************************
 
***** Standardizing Variables *****
* Standardizing is a common variable manipulation technique where you take a 
* value then subtract the mean and divide by the standard deviation 
* newvalue = (old value-mean)/standard deviation 
* There are two ways to generate this variable. I am going to show you both 
* commonly used methods
 
* Option 1: Manually Entering Data* 
* Step 1: find mean and standard deviation of a variable 
 
sum age 
 
* sum VariableName
 
* Step 2: generate new variable using the values you pulled from the 
* summary statistics in Step 1
 
gen newage1= (age- 49.15576)/17.69279
 
* gen NewVariableName = (OldVariableName - mean)/standard deviation 
 
* Step 3: run the new variable to make sure the coding was correct. 
* Note if the mean is 0 and the standard deviation is 1 you coded it 
* correctly
 
sum newage1
 
* sum NewVariableName 
* Option 2: Using Stata built in programming to Standardize 
* Step 1: generate a new variable using the Stata programming
egen newage2 = std (age)
* egen NewVariableName = std (OldVariableName)
***** Collapsing Variables *****
* Sometimes we want to collapse people into larger categories for our 
* analysis. As always, make sure you have a strong
* theoretical, analytical, and modeling reason to generate new variables
* For this example we are going to alter political party affiliation into 
* a variable of just democrats and republicans (making everyone else missing)
*Step 1: Look at the codebook for the variable so you know how to recode it
codebook partyid
* codebook VariableName
* Step 2: Generate a new variable - we are doing to make this a poltical 
* identification with democrat = 0, independent = 1, republican = 2
gen political = 0 if partyid ==0
replace political = 0 if partyid ==1
replace political = 1 if partyid ==2
replace political = 1 if partyid ==3
replace political = 1 if partyid ==4
replace political = 2 if partyid ==5
replace political = 2 if partyid ==6
 
* check to make sure we did the recode correctly 
tab1 partyid political 
* NOTE: because we did not pull in the 'other party' our number of 
* respondents dropped! Make sure you make these decisions with theory and 
* methods in mind. 
***** Generating Variables from Multiple Variables *****
* There are times when you will want to generate variables from a 
* combination of two or more variables. This may be as simple as an 
* interaction or very complex. As always, make sure you have a strong
* theoretical, analytical, and modeling reason to generate new variables
* Generate an interaction variable 
* Step 1: Generate a dichotmous variable for two (or more variables)
* For this example we are going to generate an interaction for 
* people who 
* Generating a variable for people who attended college 
* First we need to look at the codebook for education so we know how
* the variable is coded so it can be edited 
codebook educ
* From the codebook we see that education is by years of 
* education. Since these are US adults we can assume everyone
* who has 12 years of education or less did not go to college
* and everyone who has 13 years or more of education did go 
* to college. So we are going to generate a variable where 0-12
* is high school or less (equal to 0) and 13-20 is some college
* or more (equal to 1)
gen college =0 if educ==0
replace college =0 if educ ==1
replace college =0 if educ ==2
replace college =0 if educ ==3
replace college =0 if educ ==4
replace college =0 if educ ==5
replace college =0 if educ ==6
replace college =0 if educ ==7
replace college =0 if educ ==8
replace college =0 if educ ==9
replace college =0 if educ ==10
replace college =0 if educ ==11
replace college =0 if educ ==12
replace college=1 if educ ==13
replace college=1 if educ ==14
replace college=1 if educ ==15
replace college=1 if educ ==16
replace college=1 if educ ==17
replace college=1 if educ ==18
replace college=1 if educ ==19
replace college=1 if educ ==20
* gen NewVariableName = value if OldVariableName == value
* replace NewVariableName = value if OldVariableName == value
tab educ 
tab college  
*tab OldVariableName
*tab NewVariableName
* Generating a variable for people who have kids
* First we need to look at the codebook for number of children 
* so we know how the variable is coded so it can be edited 
codebook childs
* From the codebook we see that number of children ranges from 
* 0 to eight or more. We are going to generate a variable that
* indicates if someone has any children. So we need to create a 
* dichotmous variable with no children (equal to 0) and one or 
* more children (equal to 1) 
gen children =0 if childs==0
replace children =1 if childs==1
replace children =1 if childs==2
replace children =1 if childs==3
replace children =1 if childs==4
replace children =1 if childs==5
replace children =1 if childs==6
replace children =1 if childs==7
replace children =1 if childs==8
* gen NewVariableName = value if OldVariableName == value
* replace NewVariableName = value if OldVariableName == value
* replace NewVariableName = value if OldVariableName == value
 
tab childs
tab children 
*tab OldVariableName
*tab NewVariableName
* Generating an interaction variable for people who are college 
* educated and have children 
gen collegechildren= college*children 
* gen NewVariableName = OldVariable1 * OldVariable2
tab college children 
* check that you recoded correctly by running a crosstab 
* of the original variables 
 
* Generating a variable that calculates the difference between two 
* variables: 
* You can use any standard mathematical formula in Stata
* when generating new variables. For this example we are going to 
* calculate the difference between how many hours people worked last 
* week and how many hours their partner worked last week. 
* Note: positive values mean the respondent works more than their 
* partner, 0 values mean they work the same amount of hours, and a 
* negative number means their partner works more than the respondent
tab1 hrs1 cohrs1
* tab1 OldVariable1 OldVariable2
gen workdiff = hrs1- cohrs1
*gen NewVariableName = OldVariable1 - OldVariable2
tab workdiff 
*******************************************
**************** Analysis *****************
*******************************************
 
***** Chi-Squared (used with a Crosstabultion) *****
* Crosstabulation is a basic analysis generally conducted with two variables
* to roughly estimate the pattern between given variables 
* JUST A CROSSTABULATION: 
tab colath degree
 
* tab VariableName1 VariableName2
* ADDING A CHI SQUARE TO THE CROSSTABULATION
tab colath degree, chi
*tab VariableName1 VariableName2, chi
***** ANOVA *****
* I am going to walk you through a ton of ANOVA options- be sure to consult
* your statistical model assumptions to determine which ANOVA works best
* for your analysis. 
* Variables used for analysis 
* degree- Respondent's highest level of education
* colath - Should we allow atheist to teach? 
* colhomo - Should we allow a communist to teach?
* colmil - Should we allow a militarist to teach?
* colrac - Should we allow a racist to teach?
   
***** One-Factor ANOVA ***** 
* one factor * 
        anova colath degree 
   
   * anova DependentVariable IndependentVariable
 
    ***** Two-way ANOVA ***** 
* two factors *
        anova colath degree colhomo 
 
* anova DependentVariable IndependentVariable1 IndependentVariable2
* two factors plus interaction *
        anova colath degree colhomo degree#colhomo
* anova DependentVariable IndependentVariable1 IndependentVariable2 
* IndependentVariable1#IndependentVariable2 (note: code should be on
* one line)
 
***** or more simply ***** 
        anova colath degree##colhomo
* anova DependentVariable IndependentVariable1##IndependentVariable2 
 
***** Three-way factorial ANOVA ***** 
* three way anova * 
anova colath degree colhomo colmil
* anova DependentVariable IndependentVariable1 IndependentVariable2 
* IndependentVariable3
* three way anova with two-way interactions *
        anova colath degree colhomo colmil degree#colhomo colhomo#colmil degree#colmil
* anova DependentVariable IndependentVariable1 IndependentVariable2 
* IndependentVariable3 IndependentVariable1#IndependentVariable2
* IndependentVariable2#IndependentVariable3 
* IndependentVariable1#IndependentVariable3
*three way anova with all interactions 
anova colath degree##colhomo##colmil
 
***** Scalars with code for ANOVA *****
* CODE   EXPLANATION 
* e(N)                number of observations
* e(mss)              model sum of squares
* e(df_m)             model degrees of freedom
* e(rss)              residual sum of squares
* e(df_r)             residual degrees of freedom
* e(r2)               R-squared
* e(r2_a)             adjusted R-squared
* e(F)                F statistic
* e(rmse)             root mean squared error
* e(ll)               log likelihood
* e(ll_0)             log likelihood, constant-only model
* e(ss_#)             sum of squares for term #
* e(df_#)             numerator degrees of freedom for term #
* e(ssdenom_#)        denominator sum of squares for term # (when using
*                       nonresidual error)
* e(dfdenom_#)        denominator degrees of freedom for term # (when
*                      using nonresidual error)
* e(F_#)              F statistic for term # (if computed)
* e(N_bse)            number of levels of the between-subjects error term
* e(df_bse)           degrees of freedom for the between-subjects error
*                      term
* e(box#)             Box's conservative epsilon for a particular
*                      combination of repeated variables (repeated()
*                     only)
* e(gg#)              Greenhouse-Geisser epsilon for a particular
*                       combination of repeated variables (repeated()
*                       only)
* e(hf#)              Huynh-Feldt epsilon for a particular combination of
*                       repeated variables (repeated() only)
* e(rank)             rank of e(V)
 
***** Regression *****
* Simple Regreesion is quick in Stata. You will list variables 
* in the following order dependent variable followed by independent
* variables followed by control variables 
regress wordsum educ 
*regress DependentVariable IndependentVariable
regress wordsum educ maeduc paeduc speduc  
*regress DependentVariable IndependentVariable1 IndependentVariable1
*IndependentVariable2 ControlVariable
 
*******************************************
***************** Graphs ******************
*******************************************
 
***** Pie chart *****
graph pie, over(degree)
* graph pie, over (VariableName)
***** Histogram *****
histogram age
*histogram VariableName
***** Bar chart *****
graph bar, over (degree)
*graph bar, over (VariableName)
***** Box Plot *****
* one variable 
graph box childs
*graph box, VariableName
* two or more variables 
graph box childs chldidel
* graph box VariableName1 VariableName2