33
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

  • View
    217

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Generating new variables and manipulating data with STATA

Biostatistics 212

Lecture 3

Page 2: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Housekeeping

• Lab 1 cleanup

• Computer and software issues

• Change final session from 11/29 12/1– (Thursday instead of Tuesday)

• Change schedule – Excel NEXT session

Page 3: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Today...

• What we did last week, and why it was unrealistic• What does “data cleaning” mean?• How to generate a variable• How to manipulate the data in your new variable• How to label variables and otherwise document

your work• Examples

Page 4: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Last time…

• What was unrealistic?

Page 5: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Last time…

• What was unrealistic?– The dataset came as a Stata .dta file

Page 6: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Last time…

• What was unrealistic?– The dataset came as a Stata .dta file– The variables were ready to analyze

Page 7: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Last time…

• What was unrealistic?– The dataset came as a Stata .dta file– The variables were ready to analyze– Most variables were labeled

Page 8: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Last time…

• I.e. – The data was “clean”

Page 9: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

How your data will arrive

• On paper forms

• In a text file (comma or tab delimited)

• In Excel

• In Access

• In another data format (SAS, etc)

Page 10: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Importing into Stata

• Options:– Cut and Paste– insheet, infile, fdause, other flexible

Stata commands– A convenience program like “Stat/Transfer”

Page 11: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Importing into Stata

• Make sure it worked– Look at the data

Page 12: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Importing into Stata

• Example – neonatal opiate withdrawal data

Page 13: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Exploring your data

• Figure out what all those variables mean

• Options– Browse, describe, summarize, list in STATA– Refer to a data dictionary– Refer to a data collection form– Guess, or ask the person who gave it to you

Page 14: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Exploring your data

• Example: Neonatal opiate withdrawal data

Page 15: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Exploring your data

• Example: Neonatal opiate withdrawal data

• Problems arise…– Sex is m/f, not 1/0

– Gestational age has nonsense values (0, 60)

– Breastfeeding has a bunch of weird text values

– Drug variables coded y or blank

– Many variable names are obscure

Page 16: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Cleaning your data

• You must “clean” your data so it is ready to analyze.

Page 17: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Cleaning your data

• Cleaning tasks– Check for consistency and clean up non-sense data and

outliers– Deal with missing values– Code all dichotomous variables 1/0– Categorize variables meaningfully (for Table 1, etc) – Derive new variables– Rename variables

• With common sense, or with a consistent scheme

– Label variables– Label the VALUES of coded variables

Page 18: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Cleaning your data

• The importance of documentation– Retracing your steps

• Document every step using a “do” file

Page 19: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 1 – make a new variable

• Creating new variables

generate newvar = expression

• An expression can be:– A number (constant) - generate allzeros = 0– A variable - generate ageclone = age– A function - generate agesqrt = sqrt(age)

Page 20: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 1 – make a new variable

• Getting rid of a variable

drop var

• Getting rid of observations

drop if boolean exp

Page 21: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 2 – manipulating the values

• Changing the values of a variable

replace var = exp [if boolean exp]

• A boolean expression evaluates to true or false for each observation

Page 22: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 2 – manipulating the values

• Examples

generate male = 0replace male = 1 if sex==“male”

generate ageover50 = 0replace ageover 50 = 1 if age>50

generate complexvar = agereplace complexvar = (ln(age)*3)

if (age>30 | male==1) & (othervar1>=othervar2)

Page 23: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 2 – manipulating the values

• Logical operators for boolean expressions:

English StataEqual to ==Not equal to !=, ~=Greater than >Greater than/equal to>=Less than <Less than/equal to <=And &Or |

Page 24: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 2 – manipulating the values

• Mathematical operators:

English StataAdd +Subtract -Multiply *Divide /To the power of ^Natural log of ln(expression)Base 10 log of log10(expression)Etcetera…

Page 25: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 2 – manipulating the values

• Another way to manipulate data

Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression]

• More complicated, but more flexible command than replace

Page 26: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 2 – manipulating the values

• Examples

Generate male = 0Recode male 0=1 if sex==“male”

Generate raceethnic = raceRecode raceethnic 1=6 if ethnic==“hispanic”(Replace raceethnic = 6 if ethnic==“hispanic” & race==1)

Generate tertilescac = cacRecode min/54=1 55/82=2 83/max=3

Page 27: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaningBasic skill 3 – labeling variables

• You can label:– A dataset label data “label”

– A variable label var varname “label”

– Values of a variable (2-step process)label define labelname value1 “label1” [value2 “value2”…]

Label values varname labelname

Page 28: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Cleaning your data

• Cleaning tasks– Check for consistency and clean up non-sense data– Deal with missing values– Code all dichotomous variables 1/0– Categorize variables meaningfully (for Table 1, etc) – Derive new variables– Rename variables

• With common sense, or with a consistent scheme

– Label variables– Label the VALUES of coded variables

Page 29: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaning

• Example: Neonatal opiate withdrawal data

Page 30: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Data cleaning

• At the end of the day you have:– 1 raw data file, original format– 1 raw data file, Stata format– 1 do file that cleans it up– 1 log file that documents the cleaning– 1 clean data file, Stata format

Page 31: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Summary

• Data cleaning– ALWAYS necessary to some extent– ALWAYS use a do file, don’t overwrite

original data– Check your work– Watch out for missing values– Label as much as you can

Page 32: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Lab this week

• It’s long• It’s important• It’s hard

• But this year, we have 2 sessions for it!

• Email lab to [email protected]• Due 10/11 at Midnight

Page 33: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Preview of next week…

• Using Excel– What is it good for?– Formulas– Designing a good spreadsheet– Formatting