Transcript
Page 1: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

A Gentle Introduction to Stata

Page 2: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University
Page 3: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

A Gentle Introduction to Stata

Alan AcockOregon State University

A Stata Press PublicationSTATA CORPORATIONCollege Station, Texas

Page 4: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

Stata Press, 4905 Lakeway Drive, College Station, Texas 77845

Copyright c© 2005 by StataCorp LPAll rights reservedTypeset in LATEX2εPrinted in the United States of America

10 9 8 7 6 5 4 3 2 1

ISBN !!

This book is protected by copyright. All rights are reserved. No part of this book may be repro-duced, stored in a retrieval system, or transcribed, in any form or by any means—electronic,mechanical, photocopying, recording, or otherwise—without the prior written permission ofStataCorp LP.

Stata is a registered trademark of StataCorp LP. LATEX2ε is a trademark of the AmericanMathematical Society.

Page 5: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

AcknowledgementsI would like to acknowledge the support of the Stata staff who have worked with

me on this project. Special thanks goes to Lisa Gilmore, the Production Manager,xxxxx, the Copy Editor, and xxx for verifying all the commands used in this volume. Ialso want to thank my students who have tested my ideas for the book. They are toonumerous to mention, but special thanks goes to Patricia Meierdiercks and ShannonWanless.

Bennet Fauber, during the time he was affiliated with Stata Corporation, providedhours and hours of support on all aspects of this project. He taught me the LATEX2εdocument preparation system used by Stata Press and his patience with many of myproblems and mistakes has inspired me to have more patience with my own students.Bennet also had a major input on the topical coverage and organization of this volume.He provided the initial draft of chapter 4 and his superior expertise on Stata commands,data management, and do-files was critical. Bennet also provided extensive editorialsuggestions and substantive editing for the first three chapters of the book. Whateverquality this book has owes an enormous debt to Bennet’s conceptual and technicalcontributions. The books completion owes a lot to his encouragement.

Finally, I would like to thank my wife, Toni Acock, for her support and for hertolerance for my endless excuses for why I could not do things. She had to pick upmany tasks I should have done and she usually smiled when told it was because I hadto finish this book.

Page 6: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University
Page 7: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

Contents

Preface xiii

Notation and Typography xv

1 Getting Started 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Stata Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Using an existing dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 An example of a short Stata session . . . . . . . . . . . . . . . . . . . . 9

1.5 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Entering Data 19

2.1 Creating a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 An example questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Develop a coding system . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Entering data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 Labeling values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 Saving your dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Checking the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Preparing Data for Analysis 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Page 8: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

viii Contents

3.2 Plan your work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Create value labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Reverse code variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5 Create and modify variables . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6 Create scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.7 Save some of your data . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Author index 59

Subject index 61

Page 9: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

List of Tables

2.1 Example questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Example codebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Example coding sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Sample project task list . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 NLSY97 sample codebook entries . . . . . . . . . . . . . . . . . . . . . 40

3.3 Reverse coding plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Arithmetic symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Page 10: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University
Page 11: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

List of Figures

1.1 Stata’s opening screen . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 The . Prefs . Save Windowing Preferences menu . . . . . . . . . . . . . 4

1.3 The . Prefs . Stata compact setting appearance menu . . . . . . . . . . 5

1.4 The Stata screen layout used in this book . . . . . . . . . . . . . . . . 6

1.5 Stata’s tool bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Stata command to open cancer dataset . . . . . . . . . . . . . . . . . . 8

1.7 Histogram of age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.8 Histogram dialog box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.9 The tabs on the histogram dialog box . . . . . . . . . . . . . . . . . . 12

1.10 The Title tab of the histogram dialog box . . . . . . . . . . . . . . . . 13

1.11 The Options tab of the histogram dialog box . . . . . . . . . . . . . . 13

1.12 First attempt at an improved histogram . . . . . . . . . . . . . . . . . 14

1.13 Final histogram of age . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1 Data editor window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Variable name and variable label . . . . . . . . . . . . . . . . . . . . . 27

2.3 Define schemes dialog box . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Define schemes dialog box . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Describe dataset dialog box . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Create new variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Recode: specifying recode rules on the main tab . . . . . . . . . . . . . 44

3.3 Recode: specifying new variable name on the options tab . . . . . . . 44

3.4 Create new variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Two-way tabulation dialog . . . . . . . . . . . . . . . . . . . . . . . . . 50

Page 12: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

xii List of Figures

3.6 The extended generate dialog . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 The extended generate dialog . . . . . . . . . . . . . . . . . . . . . . . 53

3.8 Selecting variables to drop . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.9 Selecting observations with an expression . . . . . . . . . . . . . . . . 55

Page 13: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

Preface

This book was written with a particular reader in mind. This reader needs to learnStata, but has no prior experience with other statistical software packages and is learn-ing social statistics. When I learned Stata myself I found no books that I felt werewritten explicitly for this reader. There are certainly excellent books on Stata, but theyassumed prior experience with other packages such as SAS or SPSS, they assumed afairly advanced working knowledge of statistics, or both of these. These books are ableto move more quickly to more advanced topics, but they left my intended reader in thedust. Readers who have more background in statistical software and statistics than Iam assuming here will be able to read chapters quickly and even skip sections. The goalis to move the true beginner to a level of competence using Stata.

With our target reader in mind, I make far more use of the Stata menu system thanany other books about Stata. Advanced users may not see the value to using the menusand the more people learn about Stata the less they will rely on the menus. Also, evenwhen using the menu system it is still important to save a record of the sequence ofcommands you ran. Even though I rely on the commands much more than the menusin my own work, I still find value in the menus. They include many options that I mightnot have known or might have forgotten. This is most evident with graphs where thevisual quality of graphs can be greatly enhanced using the menu system.

To illustrate the menu system as well as graphics, I have included over 80 figures,many of which show menus. There are numerous tables and extensive Stata “results”that are presented as they appear on the screen and are given a substantive interpreta-tion. This is done in the belief that beginning Stata users need to learn more than justhow to produce the results. It is also necessary to go through the results and interpretthem.

I have tried to use “real” data. There are a few examples where it is just too mucheasier to illustrate a point with hypothetical data, but for the most part, I use datathat is in the public domain. The General Social Survey for 2002 is used in manychapters as is the National Survey of Youth, 1997. I’ve simplified the files by droppingmany of the variables in the original datasets, but I’ve kept all of the observations. Ihave tried to pick examples from a variety of social science fields and I have includedadditional variables so that instructors as well as readers can make additional examplesand exercises that are tailored to their discipline. People who are use to working withstatistics books that have contrived data with just a few observations, presumably sowork can be done by hand, may be surprised to see over 1,000 observations in ourdatasets. Working with these files provides better experience for other real world data

Page 14: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

xiv Preface

analysis.

The exercises use the same datasets that are used in the rest of the book. A numberof the exercises require some data management prior to estimating a model. This isdone in the belief that learning data management requires a lot of practice and cannotbe isolated in a single chapter or single set of exercises.

This book takes the student through much of what is done in introductory andintermediate statistics courses. We cover descriptive statistics, charts, graphs, tests ofsignificance for simple tables, tests for one and two variables, correlation and regression,analysis of variance, multiple regression, and logistic regression. By combining this withan introduction to creating and managing a dataset, students are well prepared to goeven further. More advanced statistical analysis using Stata is often even simpler from aprogramming point of view than what we cover. If an intermediate course goes beyondwhat we do with logistic regression to multinomial logistic regression, for example,the programming is simple enough. The command logit is simply replaced with thecommand mlogit. The added complexity of these advanced statistics is the statisticsthemselves and not the Stata commands that implement them. Therefore, althoughmore advanced statistics are not included in this book, the reader who learns thesestatistics will be more than able to learn the corresponding Stata commands from theStata documentation and help system.

I assume the reader is running Stata on Windows based PC. Stata works as wellor better on Macs and Unix systems. Readers who are running Stata on one of thosesystems will have to make minor adjustments.

Alan C. Acock Corvallis, Oregon

Page 15: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

Notation and Typography

We designed this book for you to learn by doing, so we expect you to read this bookwhile sitting at a computer so you can try using the sequences of commands containedin the book to replicate our results. In this way, you will be able to generalize thesesequences to suit your own needs.

Generally, we use the typewriter font command to refer to Stata commands, syntax,and variables. We show commands as they will appear in Stata output. That includesthe . prompt, which is not part of the command and should not be typed.

Except for some very small expository datasets, all the data we use in this book arefreely available for you to download. For example,

. use http://www.stata-press.com/data/!!!/gss 2002.dta, clear

Try it.

This text complements the material in the Stata manuals but does not replace it.For example, how to generate tables and graphs are shown in Chapters 5 and 6, butthese are only a few of the possibilities described in the Stata Reference Manual. Ourhope is to give you sufficient background that you can use the manuals effectively.

Page 16: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University
Page 17: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1 Getting Started

1.1 Introduction1.2 The Stata Screen1.3 Using an existing dataset1.4 An example of a short Stata session1.5 Conventions1.6 Chapter Summary1.7 Exercises

1.1 Introduction

This book was written with the belief that the best way to learn data analysis is toactually do it with real data. These days, doing statistics means doing statistics with acomputer.

Work along with the book

Although it isn’t absolutely necessary, you’ll probably find it very helpful to haveStata running while you read this book so you can follow along and experimentfor yourself when you have a question about something. Having your hands on akeyboard and replicating the instructions in this book will make the lessons thatmuch more effective, but more importantly, you’ll get in the habit of just tryingsomething new when you think of it and seeing what happens. In the end, that ishow you will really learn how Stata works. The other great advantage to followingalong is that you can save the examples we do for future use.

Stata is a powerful tool for analyzing data. Stata can make statistics and dataanalysis fun because it does so much of the tedious work for you. A new user of Statashould start by using the menus. As you learn more about Stata you will be able todo more sophisticated analyses with Stata commands. Commands can be saved in files,that Stata calls do-files so a series of commands can be run all at once, which can save agreat deal of time. Learning Stata well is an investment that will pay off in saved time

1

Page 18: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2 Chapter 1. Getting Started

later. Stata is constantly being extended with new capabilities, which can be installedby Stata itself from the internet. Stata is a program that grows with you.

Stata is a command driven program. It has a remarkably simple command structurethat you use to tell it want you want it to do. You can use a menu to generate thecommands (this is a great way to learn the commands or prompt yourself if you don’tremember one exactly), or you can enter them directly. If we enter the summarizecommand, we’ll get a summary of all the variables in our dataset (mean, standarddeviation, number of observations, minimum value, and maximum value). Enter thecommand tabulate gender and Stata will make a frequency distribution of the variablecalled gender, showing you the number and percentage of men and women in yourdataset.

After you have used Stata for awhile, you may want to skip the menu and enter thesecommands directly. When you are just beginning, however, it is easy to be overwhelmedby all of the commands available in Stata. If you were learning a foreign language youwould have no choice but to memorize hundreds of common words right away. This isnot necessary when you are learning Stata because the menus are so easy to use. Youwill learn how to use Stata’s menus to do much of your work. Even experienced userscan forget commands and they find the menus useful, especially for the more complexprocedures, such as complex graphs.

Stata has done a lot to make the menus as friendly as possible so you feel confidentusing them. The menus often show many options, which control the results that areshown and how they are displayed. Even experienced Stata programmers will discoverpossibilities in the menus that they did not know existed. The menus have defaultvalues that are often appropriate, so you may be able to do a great deal of work withoutspecifying any options.

As we progress, you will be doing more and more complex analyses. It is possibleto do this with the menus, however Stata lets you create files that contain a series ofcommands that can be run all at once. These files, called do-files, are essential once youhave a large number of commands to run. You can reopen the do-file a week or evenmonths later and repeat exactly what you did. Keeping a record of what you did thisway is an obligation of scholars. The ability to replicate results of elaborate analysesis impossible if you fail in this obligation. Fortunately, Stata makes this easy. You willlearn more about this in Chapter 4. The do-files that reproduce most of the tables,graphs, and statistics for each chapter are available on the web page for this book.

We will cover many of the graphical and statistical ways of analyzing data that areincluded in standard statistics textbooks. Because Stata is so powerful and easy to use,we may include some analyses that are not covered in your textbook. If you come toa procedure that you have not already learned in your statistics text, give it a try. Ifit seems too daunting, you can skip that section and move on. On the other hand, ifyour statistics textbook covers a procedure that we omit, you might search the menusyourself. Chances are you will find it there.

Depending on your needs you might want to skip around in the book. We tend to

Page 19: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.2 The Stata Screen 3

learn best when we need to know something, so skipping around to the things you knowwill be useful immediately may be the best use of the book and your time. Some things,though, require prior knowledge of other things, so if you are new to Stata, you mayfind it best to work through the first four chapters carefully in and in order. After that,you will be able to skip around more freely as your needs or interests take you.

1.2 The Stata Screen

When you open Stata you will see a screen that looks something like Figure 1.1.

Figure 1.1: Stata’s opening screen

You can rearrange the windows to look the way you want. You rearrange the windowsjust as you would in other programs. A lot of experienced Stata users have particularways of arranging these screens and you should feel free to experiment with the layout.There is a web page at UCLA that has movies showing you how to manipulate thescreen to your suiting (http://www.ats.ucla.edu/stat/stata/seminars/stata9/gui.htm) .

If you have Stata installed on your own computer, use the . Prefs menu to workwith your windowing preferences, as shown in Figure 1.2.

You can click on the . Prefs menu, . Manage Preferences, . Load Preferences, . Com-pact Windows settings option. This produces a screen like the one in Figure 1.3. Click

Page 20: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

4 Chapter 1. Getting Started

Figure 1.2: The . Prefs . Save Windowing Preferences menu

on the tab that says Review (right side of Figure 1.3). You will see a small stick pinwith the point going to the left. Click on this pin one time and the pin will turn upright.Next, click on the tab that says Variables and click on the stick pend for that window.Now you should have two panels on the right side, with Review on the top and Variablesbelow it. The final thing we want to do is to get the Command window to go all the wayacross the bottom. If you put your cursor in the blue bar at the top of the commandwindow, click and drag it you will see a diamond appear near the bottom of the screen(ignore the diamonds on the other edges and in the middle of the screen). Continue todrag your cursor until your cursor is over the diamond on the bottom of the screen andlet go. If you do this right, the screen will look like the one in Figure 1.4. This is thescreen layout we use in this book.

If you want to save this screen configuration click on . Prefs menu, . Manage Pref-erences, . Save Preferences, . New Preference Set. This opens a dialog window in whichyou can enter a name. This will save the configuration under this name. In the future,anytime you want this configuration you can load it. This flexibility is an extendedfeature in Stata 9. You eventually will probably enjoy this flexibility, but it is a bitconfusing the first time you use this feature. You can check out the UCLA movie for anextensive explanation of how to do this. If you can’t get it to work exactly as we havehere, it is always easy to select the compact setting option.

If you are in a computer lab, the computers may be set up to erase all your changeswhen the computer is restarted. If this is the case, there is nothing you can do topreserve your preferences. Luckily, it is easy to rearrange the screen the way you like itwhen you return to Stata.

Stata’s main window contains four other windows, each of which has a name thatwe will use in the rest of the book. The window name appears in the title bar of each,and they are Review, Variables, Results, and Command.

When you open a file that contains Stata data, which we’ll call a Stata dataset, alist of the variables will appear in the Variables window.

When Stata executes a command, it prints any output into the Results window.The command itself will also be printed to the Results window, preceded by a . (dot)prompt, and the command will be followed immediately by the output the commandgenerates. The commands you run are also kept in the Review window. If you clickon one of the commands listed in the Review window, it will appear in the Command

Page 21: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.2 The Stata Screen 5

Figure 1.3: The . Prefs . Stata compact setting appearance menu

Page 22: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

6 Chapter 1. Getting Started

Figure 1.4: The Stata screen layout used in this book

Page 23: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.3 Using an existing dataset 7

window.

When not using the menu system, the Command window is where commands youwant to execute get entered. You can recall commands from the Review window or typethem in yourself. You can also edit commands that appear in the Command window.You can also use the PgUp and PgDn keys to recall commands from the Review window.We’ll illustrate all of these methods in the coming chapters.

The grey bar at the bottom is called the status bar, and it will display the currentworking directory (folder). This will most likely be either C:\Stata8 or C:\Data. Thisis the active folder where Stata will save data. It is possible to change this and to storedata wherever you wish.

Stata has the usual Windows title bar, on the right of which appear the threebuttons (in order from left to right) to minimize, to expand to full-screen, and to closethe program.

Immediately below the title bar is the menu bar, where the names of the menusappear, as shown in Figure 1.2 with the . Prefs menu pulled down. Some of the menuitems will look familiar because they are used in other programs such as . File, . Window,and . Edit. The . Data, . Graphics, and . Statistics are more likely to be new, but theirnames provide a good idea of what you will find under them.

Below the title bar is the toolbar, shown in Figure 1.5, that contains icons (or tool

Figure 1.5: Stata’s tool bar

buttons) that provide alternate ways to do some of the things you would normally dowith the menus. The most useful of these are the four leftmost icons. From left to right,they are icons to open a dataset, to save a dataset, to print the contents of the Resultswindow, and to start or stop a log of your session. A complete list of the toolbar iconsand their functions appears in the Getting Started with Stata for Windows manual.

1.3 Using an existing dataset

In Chapter 2 we will learn how to create our own dataset, save it, and use it again.We will also learn how to use datasets that are on the internet. For now we will use asimple dataset that came with Stata. Although we could use the menu to do this, fornow we will enter a simple command. Click once in the Command window to put thecursor there, then type the command sysuse cancer, clear so the window looks likethe one in Figure 1.6.

Note that we used a typewriter font to indicate the text to type into the Commandwindow. Stata commands do not have any special characters at the end, so if you seea command that seems to end with a punctuation mark, the punctuation is not part of

Page 24: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

8 Chapter 1. Getting Started

Figure 1.6: Stata command to open cancer dataset

the command. Otherwise, anything that should be entered as a command will appearin a typewriter font. There may be times when we put a command onto a line by itself.When we do this we will have the dot preceding it to be consistent with Stata manuals,as in

. sysuse cancer, clear

All of Stata’s menus generate commands, which will be printed in the Review windowand also in the Results window after the dot prompt, which we mentioned earlier. Ifyou make a point of looking at the command Stata prints each time you use the menus,you will quickly learn the commands. We may also include the equivalent command inour text after explaining how to navigate to it through the menus.

Stata commands should always be typed into the Command window using the Returnkey. The command may wrap onto more than one line, but if you use the Return keyin the middle of a command, Stata will interpret that as the end of the command, andan error will likely be generated. The rule is that you should just keep typing whenentering a command, no matter how long the command is. You only press the Returnwhen you want to execute the command.

In addition to using the typewriter font for commands, we will also use it for variablenames, for the names of datasets, and to show Stata’s output. In general, we will usethe typewriter font whenever the text is something that we can type into Stata or whenthe text is something that Stata might print as output. This may seem cumbersomenow, but you’ll catch on quickly. That said, let’s move on.

The sysuse command we just used will find the sample datasets on your computerby name alone, without the extension, in this case the dataset name is cancer and thefile that is actually found is called cancer.dta. The cancer dataset was installed withStata. This particular dataset has 48 observations, and 4 variables related to a cancertreatment.

What if you forget the command sysuse? You could open a file that comes withStata using the menus: . File, . Example datasets. This opens up a new window inwhich you can click on Example datasets installed with Stata. This takes you toa window that lists all the datasets that come with Stata. You can click on . use toopen the dataset.

Now that we have some data read into Stata, enter the describe command in theCommand window. That’s it, just describe followed by the Enter key, which will yielda brief description of the contents of the dataset.

Page 25: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.4 An example of a short Stata session 9

. describe

Contains data from C:\Stata8\ado\base/c/cancer.dtaobs: 48 Patient Survival in Drug Trialvars: 4 3 Sep 2002 12:25size: 576 (99.9% of memory free)

storage display valuevariable name type format label variable label

studytime int %8.0g Months to death or end of exp.died int %8.0g 1 if patient dieddrug int %8.0g Drug type (1=placebo)age int %8.0g Patient’s age at start of exp.

Sorted by:

The description includes a lot of information: The full name of the file, cancer.dta(including the path entered to read the file), the number of observations (48), the numberof variables (4), the size of the file (576 bytes)and how much of Stata’s memory is stillavailable, a brief description of the dataset (Patient Survival in Drug Trial), andthe date the file was last saved are displayed at the top of the table displayed. The bodyof the table displayed shows the names of the variables on the far left and the labelsattached to them on the far right. We’ll discuss the middle columns later.

Now that you have opened the cancer dataset, note that the Variables window liststhe four variables studytime, died, drug, and age.

Box 1.1 Internet Access to Our Data

Stata can use data that is stored on the internet just as easily as data stored on yourown computer. If you did not have the cancer dataset installed on your machine,you could read it by using the command webuse cancer, clear which will readthe cancer dataset from Stata’s website. This is a particularly easy way to use thesample datasets. You are not limited to data stored at the Stata site, though. Thecommanduse http://www.ats.ucla.edu/stat/stata/notes/hsb2, clearwill open a dataset stored at UCLA. Stata can use any URL (web address) thatpoints to a Stata dataset.

1.4 An example of a short Stata session

Since you have loaded the cancer dataset, we will execute a basic Stata analysis com-mand. Type summarize into the Command window, then press Enter.

Rather than typing in the command directly, you could use the . Data . Describedata . Summary Statistics menus to open the corresponding dialog box. Simply clickingthe OK button will produce the summarize command we just entered.

Page 26: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

10 Chapter 1. Getting Started

You might want to select specific variables to summarize instead of summarizingthem all. On the dialog box, click on the pulldown menu button on the Variables box,located on the right of the box, to display a list of variables. Clicking on a variablename will add it to the list in the box. Note that the menus will allow you to entera variable more than once, in which case it will appear in the output more than once.You can also type variable names into the Variables box. If you enter the summarizecommand into the Command window, simply follow it with the names of the variablesfor which you want summary statistics. For example, the summarize studytime agewill only display statistics for the two variables named. Otherwise, summarize will giveyou a summary of all the variables.

The summarize command will display the number of observations (also called casesor N ), the mean, the standard deviation, the minimum value, and the maximum valuefor each variable in the Results window.

. summarize

Variable Obs Mean Std. Dev. Min Max

studytime 48 15.5 10.25629 1 39died 48 .6458333 .4833211 0 1drug 48 1.875 .8410986 1 3age 48 55.875 5.659205 47 67

The first line of output is the dot prompt followed by the command. After that theoutput appears as a table. As you can see there were 48 observations in this dataset,which is abbreviated as Obs in the output. Observations is a generic term. These couldbe called participants, patients, or subjects, depending on your field of study. Theaverage, or mean, age is 55.875 years with a standard deviation of 5.659, and subjectsare all between 47 (the minimum) and 67 (the maximum) years old. We may roundnumbers in the text to fewer digits than shown in the output unless it would makefinding the corresponding number in the output difficult.

If you have computed means and standard deviations by hand you know how long thiscan take. Stata’s virtually instant statistical analysis is what makes Stata so valuable.It takes time and skill to set up a dataset so Stata can be used to analyze it, but onceyou learn how to set up a dataset in Chapter 2, you will be able to compute a widevariety of statistics in very little time.

We’ll do one more thing in this Stata session: We’ll make the histogram for thevariable age shown in Figure 1.7. In case you haven’t gotten to histograms yet, ahistogram is just a graph that shows the distribution of a variable, like age, that takeson many numeric values.

Simple graphs are simple to create. Just enter the command histogram age into theCommand window, and Stata will produce a histogram using reasonable assumptions.We’ll see how to use the menus for more complicated graphics shortly.

At first glance you may be happy with this graph. Stata used a formula to determine

Page 27: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.4 An example of a short Stata session 11

0.0

2.0

4.0

6.0

8D

ensi

ty

45 50 55 60 65 70Patient’s age at start of exp.

Figure 1.7: Histogram of age

that seven bars should be displayed, and this is reasonable. However, Stata starts thelowest bar (called a bin) at 47 years old and each bin is 3.33 years wide (this informationis displayed in the Results window), whereas we are not accustomed to measuring yearsin thirds of a year. Also, notice that the vertical axis measures density, whereas wemight prefer that it measure the frequency, that is, the number of people representedby each bar, instead.

This is a case where using the menus can help you enormously. Let’s go to the. Graphics . Easy graphs . Histogram menu and customize our histogram

The histogram dialog box is shown in Figure 1.8. It’s worth noting a few thingsabout this dialog box, and dialog boxes in general.

Let’s go over the parts of the dialog box quickly. There is a text box labeled Variablewith a pull-down button. As we saw on the summarize dialog, you can pull down thelist of variables, and click on a variable name to enter it in the box, or you can type thevariable’s name yourself. Only one variable can be used for a histogram, and here wewant it to be age. If we stop here and click OK, we will have recreated the histogramshown in Figure 1.7.

There are two radio buttons visible on this dialog box, one labeled Continuous data(which is shown selected in Figure 1.8), and one labeled Discrete data. Radio buttonsindicate mutually exclusive items—you can choose only one of them. In this case, weare treating age as if it were continuous, so make sure that’s the radio button that isselected.

Note that the dialog box shows a sequence of tabs just under its title bar, as shownin Figure 1.9. Different categories of options will be grouped together, and you makea different set of options visible by clicking on its tab. The options you’ve set on the

Page 28: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

12 Chapter 1. Getting Started

Figure 1.8: Histogram dialog box

current tab will not be canceled by clicking on another tab.

Figure 1.9: The tabs on the histogram dialog box

Graphs are usually clearer when there is a title of some sort, so let’s click the Titlestab and add one. Here we’ll enter Age Distribution of Participants in CancerStudy in the Title box. Let’s add the text Data: Sample cancer dataset to theNote box so we know which dataset we used for this graph.

Now click the Options tab. We’ll only use a few of the options available on thistab. Let’s select s1 monochrome from pull-down list on the Scheme box. Schemes arebasically templates that determine what the standard attributes of a graph will be,things like colors, fonts, size, which elements will be shown, and more.

In the Legend box, let’s select Yes. Whether a legend will be displayed is determinedby the scheme that is being used, and if we were to leave Default in this box, ourhistogram might have a legend or it might not, depending on the scheme. If we chooseYes or No, we override the scheme, and our selection will always be honored.

Under the section labeled Y-Axis (bottom left of the Options tab), check the Fre-quency button, since we want the bars on our histogram to represent the number ofpeople.

Page 29: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.4 An example of a short Stata session 13

Figure 1.10: The Title tab of the histogram dialog box

Figure 1.11: The Options tab of the histogram dialog box

Page 30: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

14 Chapter 1. Getting Started

In the section labeled Bins, check the box labeled Width of bins, and enter 2.5 intothe text box that becomes active (since the variable is age, that indicates 2.5 years).Also check the box labeled Lower limit of first bin, and enter 45, which will be thesmallest age represented by the bar on the left.

We made these selections to try to improve the appearance of the graph. Thedefault graph shown above started at age 47 and each bin or category was 3.33 yearswide. Starting at 45 and having a width of 2.5 years seems more natural. Now thatwe’ve made these changes, click Submit instead of OK to generate the histogram shownin Figure 1.12. Note that the dialog box doesn’t close. To close the dialog box, click onthe close button on its upper-right corner.

02

46

810

Fre

quen

cy

45 50 55 60 65Patient’s age at start of exp.

Frequency

Data: Stata sample cancer dataset

Age Distribution of Participants in Cancer Study

Figure 1.12: First attempt at an improved histogram

If you look at the complex command that the menu generated, you’ll see why evenexperienced Stata programmers will often rely on the menu system to create graphcommands.

. histogram age, frequency width(2.5) start(45) title(Age Distribution of Parti> cipants in Cancer Study) note(Data: Stata sample cancer dataset) scheme(s1mo> no) legend(on)(bin=9, start=45, width=2.5)

It is much more convenient to use the menu system to generate that command than totry to remember all its parts and the rules of their use. If you do want to enter a longcommand into the Command window, remember that Stata commands must be typedas one line. Whenever you press Enter, Stata assumes you have finished the commandand are ready to submit it for processing.

Page 31: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.4 An example of a short Stata session 15

Box 1.2 When to use Submit and when to use OK

Stata’s dialogs give you two ways to run a command: by clicking OK or by clickingSubmit. If you click OK, Stata creates the command from your menu selections,runs the command, and closes the dialog box. This is just what you want for mosttasks. There will be times, though, when you know you will want to make minoradjustments to get things just right, so Stata provides the Submit button, whichruns the command but leaves the dialog open. This way, you can go back to themenu and make changes without having to reopening the menu.

The resulting histogram in Figure 1.12 is an improvement, but we might want fewerbins. Here we are making small changes to a Stata command, then looking at the results,then trying again. The submit button useful very useful for this kind of interactive,iterative work. If the dialog box is hidden, the Ctrl-Tab key combination can be used tomove through Stata’s windows until the one you want is on top again.

Instead of a width of 2.5 years, let’s use 5 years, which is a common way to groupages. If you clicked OK instead of Submit, note that when you return to a dialog thatyou’ve already used in the current Stata session, the menu reappears with the lastvalues still there. So, all you need to do is change the 2.5 to 5 in the Bin width boxon the Options tab. The result is shown in Figure 1.13. Notice that these graphs arequite different and a researcher needs to use professional judgment to pick the bestcombination. It is easy to misrepresent a distribution using a graph and we shouldguard against this.

05

1015

Fre

quen

cy

45 50 55 60 65 70Patient’s age at start of exp.

Frequency

cancer.dta

Age Distribution of Participants in Cancer Study

Figure 1.13: Final histogram of age

To finish our first Stata session, we need to close Stata. Do this with . File . Exit.

Page 32: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

16 Chapter 1. Getting Started

1.5 Conventions

We have already mentioned some of the conventions that we’ll use throughout the restof this book. We thought it might be convenient to list them all in one place, shouldyou want to refer to them quickly.

First, we will indicate some things by a change of font.

Typewriter font We use this font when something would be meaningful to Stata asinput. We also use it to indicate Stata output.

• Stata commands, as inType the describe command into the Command window.

• Variable names, as inWe want to see a histogram of age before we decide.

• Folder and file names, as inThe file survey.dta is in the C:\DATA directory (or folder). Stata assumesthat .dta will be the extension, so you can use just survey if you prefer.

Sans serif font We use this font to indicate menu items (in conjunction with the . sym-bol), button names, and particular keys.

• Menu items, as inGo to the . Data . Variable utilities . Rename variable menu.

• Buttons that can be clicked, as inRemember, if you are working on a dialog box, it will now be up to you toclick OK or Submit, whichever you prefer.

• Keys, as inThe PgUp and PgDn keys will move you backward and forward through thecommands in the Review window.Some functions require the use of the Shift, Ctrl, or Alt keys, which are helddown while the second key is pressed. For example, Alt-F will open the . Filemenu.

Quotes We will use double quotes when we are talking about labels in a general way,but as with single quotes, we’ll use the typewriter font to indicate a specific labelin a dataset. For example,

If you decide to label this variable age“Age at first birth”, then you would enterAge at first birth into the text box.

Capitalization Stata is case sensitive. So, summarize is a Stata command, whereasSummarize is not and will generate an error if you use it. Stata also recognizescapitalization in variable names, so agegroup, Agegroup, and AgeGroup will bethree different variables. Although you can certainly use capital letters in variable

Page 33: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

1.6 Chapter Summary 17

names, you will probably find yourself making more typographical errors if youdo. We have found using all lower-case letters combined with good variable andvalue labeling to be the best practice.

If you remember from the text, we’ll also capitalize the names of the variousStata windows, but we do not set them off by using a different font. So, we willtype commands into the Command window and look at the output in the Resultswindow.

When we talk about labeled elements of a dialog box, we won’t use a special font,but we’ll capitalize the label as it is capitalized on the dialog box.

1.6 Chapter Summary

We covered the following topics in this chapter.

• The Stata interface, and how you can customize it.

• How to open a sample Stata dataset.

• The parts of a dialog box and the use of the OK and Submit buttons.

• How to summarize the variables

• How to create a simple histogram and how to modify it.

• What the font and punctuation conventions we’ll use throughout the book are.

1.7 Exercises

Some of these exercises involve little or no writing. They are simply things you cando to make sure you understand the material. Other exercises may require a writtenanswer.

1. Open Stata, maximize the screen, rearrange the windows, then save these win-dowing preferences using the . Prefs menu. Close Stata and reopen it. Did youhave any trouble doing this? If you did, view the movies at(http://www.ats.ucla.edu/stat/stata/seminars/stata9/gui.htm).

2. You can copy and paste text to and from Stata as you wish. You should tryhighlighting some text in Stata’s Results window, copying it to the clipboard, andpasting it into another program, such as your word processor. To copy highlightedtext, you can use the . Edit . Copy menu or, as indicated on the menu, Ctrl-C.You will probably need to change the font to a monospaced font (e.g., Courier),and you may need to reduce its size of the font (e.g., 9 point) after pasting itto prevent the lines from wrapping. You may wish to experiment copying Stataoutput into your word processor now, so you know which font size and typeface

Page 34: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

18 Chapter 1. Getting Started

work best. It may help to use a wider margin such as 1 inch on each side. Afteryou highlight material in the Results window, press the right mouse button. Youcan save this output in several formats including HTML. Press the right mousebutton, then select HTML. Switch to your word processor. Press Ctrl-v to pastewhat you copied. In your word processor you should use a monospaced font suchas Courier. Otherwise, columns will not line up correctly.

3. Stata has posted all the datasets used to illustrate how to do procedures in itsmanuals. You can access the manual datasets from within Stata by going to the. File . Example datasets menu, which will open Stata’s Viewer. Click on Stata9 manual datasets and then click on User’s Guide [U].

The Viewer works much like a web browser, so you can click on any of the links tothe list of datasets. Scroll down to Chapter 25 and select states3.dta. This opensa dataset that is used for Chapter 25 of the User’s Guide. Run two commands,describe and summarize. What is the variable divorce rate and what is themean (average) divorce rate for the 50 states?

4. Open the cancer dataset. Create histograms for age using a bin widths of 1, 3,and 5. Use the right mouse button and copy each graph to the clipboard and thenpaste it into your word processor. Does the overall shape of the histogram changeas the bins get wider? How?

5. UCLA has a Stata portal that has a great deal of helpful material about Stata.You might want to browse the collection now just to get an idea of what topicsthere are. The URL for the main UCLA Stata page is

http://www.ats.ucla.edu/stat/stata/

In particular, you might want to look at the links listed under Learning Stata. Onthe Stata Starter Kit page, you’ll find a link to Class notes with movies. Thesedemonstrate using Stata’s commands rather than the menu system. The topicswe will cover in the first few chapters of this book are also covered on the UCLAweb page using the commands. Each movie is about 25 minutes long. If you relyon a modem for internet connectivity, you might want not to view the movies

Page 35: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2 Entering Data

2.1 Creating a dataset2.2 An example questionnaire2.3 Develop a coding system2.4 Entering data

2.4.1 Labeling values

2.5 Saving your dataset2.6 Checking the data2.7 Chapter Summary2.8 Exercises

2.1 Creating a dataset

In this chapter, we’ll learn how to create a dataset. Data entry and verification can betedious, but it is essential to get accurate data. More than one analysis has turned outto be wrong because of errors in the data.

In the first chapter we worked with some sample datasets. Most of the analyses inthis book use datasets that have already been created for you and and are available onour web page or they may already be installed on your computer. If you must, you cancome back to this chapter when you need to create a dataset. You should look throughthe sections in this chapter that deal with documenting and labeling your data, at least,as not all prepared datasets will be well documented or labeled.

The ability to create and manage a dataset is valuable. People who know how tocreate a dataset are valued members of any research team. For small projects, collecting,entering, and managing the data can be quite straightforward. For large, complexstudies that extend over many years, managing the data and its documentation can beas complex as many of the analyses that may be conducted.

Stata, and most other statistical software almost always requires data that is laid outin a grid like a table, with rows and columns. Each row contains data for an observation(which is often a subject or a participant in your study), and each column contains the

19

Page 36: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

20 Chapter 2. Entering Data

measurement of something of interest about the observation such as age. This will befamiliar to you if you’ve used a spreadsheet.

Variables and Items

When working with questionnaire data, an item from the questionnaire almostalways corresponds to a variable in the dataset. So, if you ask for a respondent’sage, you’ll have a variable called age. Many questionnaire items are designed tobe combined together to make scales or composite measures of some sort, and anew variables will be created to contain those scales, but there is no correspondingitem on the questionnaire. A questionnaire may also have a single item where therespondent is asked to “mark all that apply”, and it is fairly common for there tobe a variable for each category that could be marked under that item. The termsitem and variable are often used interchangeably, but they are not synonyms.

There is more to a dataset than just a table of numbers. Datasets usually containlabels that help the researcher use the data more easily and efficiently. In a dataset,the columns correspond to variables, and variables must be named. We can also attacha descriptive label to each variable, which will often appear in the output of statisticalprocedures. Since most data will be numbers, we can also attach value labels to thenumbers, so it’s clear what the numbers mean.

It is extremely helpful to pick descriptive names for each item. We call thesevariable names. For example, we might call a variable that contains people’s re-sponses to a question about their state’s schools q23, but it is often better to use amore descriptive name, such as schools. If there were two questions about schools, itmight make sense to name them schools1 and schools2 than q23a and q23b. If manyitems are to be combined into scales and are not intended to be used alone, then itmay be clearer to use names that correspond to the questionnaire items for the originalvariables and reserve the descriptive names for the composite scores. No matter whatthe logic of your naming, try to keep names short. You will probably have to type themoften, and short names offer fewer opportunities for typographical errors.

Even with relatively descriptive variable names, it is usually helpful to attach alonger, and hopefully more descriptive, label to each variable. We call these variablelabels to distinugish them from variable names. For example, you might want tolabel the variable schools with the label “Public school rating”. The variable labelgives us a clearer understanding of the data stored in the variable.

For some variables, the meaning of the data is obvious. If you measure people’s heightin inches, when you see the values, you won’t wonder a lot about what they mean. Mostof us have run across questions that ask us if we “Strongly agree”, “Agree”, “Disagree”,or “Strongly disagree”, or that ask us to answer some question with a “Yes” or “No”.This sort of data is usually coded as numbers such as 1 for “Yes” and 2 for “No”, andit can make understanding tables of data much easier if the value labels are printed inthe output in addition to (or instead of) the numbers. When we label the values we call

Page 37: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.2 An example questionnaire 21

these value labels.

If your project is just a short homework assignment, you could consider skipping thelabeling (though your instructor would no doubt appreciate anything that makes yourwork easier to understand). For any serious work, though, clear labeling will make yourwork much easier in the long run.

2.2 An example questionnaire

We’ve talked about datasets in general, now let’s move on to creating one. Suppose weconducted a survey of 20 people and asked each of them six questions, which are shownin the example questionnaire shown in Table 2.1.

What is your gender?Male Female

How many years of education have you completed?0–8 910 1112 1314 1516 1718 1920 or more

How would you rate public schools in your state?very poor poorokay goodvery good

How would you rate public schools in the community you lived in as a teenager?very poor poorokay goodvery good

How would you rate the severity of sentences of criminals to prison?much too lenient somewhat too lenientabout right somewhat too harshmuch too harsh

How liberal or conservative are you?very liberal somewhat liberalmoderate somewhat conservativevery conservative

Table 2.1: Example questionnaire

Our task is to convert the answers to the questions on this questionnaire into adataset we can use with Stata.

Page 38: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

22 Chapter 2. Entering Data

2.3 Develop a coding system

Statistics is most often done with numbers, so we need to have a numeric coding systemfor the answers to our questions. Stata can use numbers or words. For example, wecould just enter Female if a respondent checked that box. However, it is usually betterto use some sort of numeric coding, so you might type 1 if the respondent checked theMale box on the questionnaire and 2 if Female was checked. This means we need toassign a number to enter for each possible response for each of the items on the survey.

We will also need a short variable name for the variable that will contain thedata for each item. Variable names can only contain upper- and lower-case letters,numerals, and the underscore character, and they can be up to 32 characters long. Itis generally better to keep your variable names 10 characters or shorter, and 8 or feweris best. Variable names cannot contain a space and should start with a letter. Alongwith the Variable name, compose a Variable label that will help relate the variableback to the original questionnaire.

If it is appropriate, you should explain the relationship between any numeric codesand the responses as they appeared on the questionnaire. For an example of all this puttogether, see the example codebook for our questionnaire that appears in Table 2.2.

The codebook is a translation tool. It translates the numeric codes in your datasetback into the questions you asked your participants and the choices you gave them foranswers. Regardless of whether you gather data with a computer aided interviewingsystem or with paper questionnaires, the codebook is an essential part of being able tomake sense of your data and your analyzes. If you do not have a codebook, you mightnot realize that everyone with eight or fewer years of education is coded the same way,with an 8. That may have an impact on how you use that variable in later analyzes.

We have added an id variable to identify each respondent. In simple cases, wejust number the questionnaires sequentially. If we have a sample of 5,000 people wewill number the questionnaires from 1 to 5000. Write the identification number on theoriginal questionnaire, and record it in the dataset. If we discover a problem in ourdataset—somebody with a coded value of 3 for gender, say—we can go back to thequestionnaire and determine what the correct value should be. Some researchers willeventually destroy the link between the ID and the questionnaire as a way to protecthuman participants. It is good to keep the original questionnaires in a safe place thatis locked and accessible only by members of the research team.

Some data will be missing—people refuse to answer certain questions, interviewersforget to ask questions, equipment fails—the reasons are many, and we need a code toindicate that data is missing. If we know why the answer is missing, we will want torecord the reason, too, so we may want different codes that correspond to the differentreasons the data might be missing.

On surveys, respondents may refuse to answer, may not express an opinion, or maynot have been asked a particular question because of the answer to an earlier question.In this case, we might code “invalid skip” (interviewer error) as −5, “valid skip” (not

Page 39: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.3 Develop a coding system 23

Question Variable Name Value Labels CodeIdentification number

id Record in order 1 to 20What is your gender?

gender Male 1Female 2No Answer −9

How many years of education have you completed?education 0-8 8

9–20 9 to 20No Answer −9

Rate public schools in your state.sch st Very poor 1

Poor 2Okay 3Good 4Very good 5No Answer −9

Rate public schools in the community you lived in as a teenager.sch com Very poor 1

Poor 2Okay 3Good 4Very good 5No Answer −9

Rate severity of sentences of criminals to prison. Are they?prison Much too lenient 1

Too lenient 2About right 3Too harsh 4Much too harsh 5No Answer −9

How liberal or conservative are you?conserv Very liberal 1

Liberal 2Moderate 3Conservative 4Very conservative 5No Answer −9

Table 2.2: Example codebook

Page 40: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

24 Chapter 2. Entering Data

applicable to this respondent) as −4, “refused to answer” as −3, −2 as “don’t know”,and −1 as “missing” (any other reason). For example, adolescent boys should not beasked when they had their first menstrual period, so we would enter a −4 for thatquestion if the respondent is male. It is important that we pick values that can never bea valid response. Stata has multiple values that are treated as missing, and we’ll showyou how to use them, but for our example questionnaire, we’ll only use one missingvalue code, −9.

We’ll be entering the data ourselves, so after we administered the questionnaire toour sample of 20 people, we prepared a coding sheet that will be used to enter thedata. The coding sheet originates from the days when data was entered by professionalkeypunch operators, but it can still be very useful or necessary. When you create acoding sheet, you are converting the data from the format used on the questionnaire tothe format that will actually be stored in the computer (a table of numbers). The morethe format of the questionnaire differs from a table of numbers, the more likely that acoding sheet will help prevent errors.

There are other reasons you may want to use a coding sheet. For example, theremay be confidential information on your questionnaires that should be shown to as fewpeople as possible; if there are many open-ended questions for which extended answerswere collected but that will not be entered, working from the original questionnairescan be unwieldy.

In general, if you transcribe the data from the questionnaires to the coding sheet,the decision about which responses go in which columns is yours. This will reduce errorsthat might arise from uncertainty by the people who do the data entry, who may nothave the information needed to make those decisions properly.

Whether you enter directly from the questionnaires or create a coding sheet will bedependent largely on the study and on the resources that are available to you. Someexperience with data entry is valuable because it will give you a better sense of theproblems you may encounter, whether it is you or someone else who enters the data, sowe have created a coding sheet for our example questionnaire, shown in Table 2.3.

Since we are not reproducing the 20 questionnaires in this book, it may be helpful toexamine how we entered the data from one of them. We will use the ninth questionnaire.We’ve assigned an id of 9, as can be seen in the first column. Reading from left toright, in the second column, for gender, we’ve recorded a 2 to indicate a woman andfor education we’ve recorded a response of 16 years. This woman rates schools in herstate and in the community in which she lived as a teenager as very good, which we cansee from the 5s in the fourth and fifth columns. She indicated that she thinks prisonsentences are too long and that she considers herself very liberal (4 and 1 in the sixthand seventh columns).

Page 41: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.3 Develop a coding system 25

id gender education sch st sch com prison conserv

1 2 15 4 5 4 22 1 12 2 3 1 53 1 16 3 4 3 24 1 8 −9 1 −9 55 2 12 3 3 3 36 2 18 4 5 5 17 1 17 3 4 2 48 2 14 2 3 1 59 2 16 5 5 4 110 1 20 4 4 5 211 1 12 2 2 1 512 2 11 −9 1 1 313 2 18 5 5 −9 −914 1 16 5 5 5 115 2 16 5 5 4 216 1 17 4 3 4 317 1 12 2 2 −9 118 2 12 2 2 2 219 2 14 4 5 4 420 1 13 3 3 5 5

Table 2.3: Example coding sheet

Page 42: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

26 Chapter 2. Entering Data

2.4 Entering data

We’ll use Stata’s Data Editor to enter our data from the coding sheet. The Data Editorprovides an interface that is similar to a spreadsheet, but it has some features that areparticularly suited to creating Stata datasets. Before opening the Data Editor, youmight save any open file and then enter the command clear in the Command window.This will give you a fresh Data Editor in which to enter data. Only enter the clearcommand if you want to start with a new dataset that has nothing in it. To open theData Editor, click on the . Data . Data Editor menu. There is also a shortcut on thetoolbar for opening the Data Editor (it looks like a spreadsheet), or you can use Ctrl-7, ifyou prefer, or you can type edit into the Command window. The Data Editor windowis shown in Figure 2.1.

Figure 2.1: Data editor window

Data is entered in the white columns, just as in other spreadsheets. We’ll enter thedata for the first respondent to our example survey, then stop. In the first white cellunder the first column, enter the identification number of the first respondent. We entera 1. Press the Tab key to move across to the next cell, where we will enter the valuefrom the second column of the coding sheet. This value is 2 because the first case is awoman. You press the Tab and continue until you have entered all the values for thefirst participant. After you’ve entered the number from the last number in the first rowof the coding sheet, press Return/Enter, which will move you down to the second row ofthe Data Editor. Press the Home key to return to the first column.

Let’s interrupt our data entry here, after we have the data entered for the firstrespondent. When we create a new dataset by entering data into the Data Editor,Stata assigns each column a default Variable name. As you’ve no doubt noticed, thecolumns are called var1, var2, var3, . . . , var7. We want to change these to the nameswe recorded in our codebook. We also composed a label for each variable as part of ourcodebook, and we should enter those, too.

Double click on the grey cell at the top of the first column. This cell now containsthe Variable name var1. Your double clicking on this cell opens the dialog box shownin Figure 2.2, in which we have already entered the variable name, id, and the label,Identification number. Enter these values in the boxes.

Page 43: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.4 Entering data 27

Figure 2.2: Variable name and variable label

In addition to the variable name and variable label, the dialog box includes a boxwhere you specify the format and it shows %8.0g. This is the default format for numericdata, and it instructs Stata to try to print numbers within 8 columns and to roundnumbers to the nearest whole number, that is, with no numbers to the right of thedecimal place. Stata will adjust the format if you enter a number wider than 8 columnsin the Data Editor.

If you have string data (that is, data that can contain letters, numbers, or symbols),you’ll see a format that looks like %9s, which indicates a variable with 9 characters thatare strings (the s). More detailed information about formats can be found at the book’sweb site.

I entered the letter l for the number 1

A common mistake is to enter a letter instated of a numeric value. This can happenif you accidently type the letter l rather than the number 1 or you might simplytype the wrong key. When this happens Stata will make the variable a “string”variable meaning that is is no longer a numeric variable. If you find you mistake andcorrect it with the correct numeric value, Stata will still think it is a string variable.There is a simple command that will force Stata to change the variable from astring variable back to a numeric variable. The command is destring varlist,replace. For example, if to change the variable id from string to numeric youwould type destring id, replace.

All of our data are numeric, none of our data contain decimals, and none is widerthan eight digits, so we can leave the format alone. Go ahead and click OK, thenrename and label the other variables in the dataset. Once you have defined a variableas numeric, Stata will warn you if you try to enter data that isn’t a number, which canhelp reduce errors. Be careful to always enter the number 1 and not the lowercase letter

Page 44: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

28 Chapter 2. Entering Data

l.

When you learn how to write Stata programs in do-files, you might want to enterthe labels by direct commands without using the menu. If we wanted to do this withthe variable prison we would enter the following commands.

. label variable prison "Harshness of prison sentences"

2.4.1 Labeling values

It is a good idea to enter the value labels we described in Table 2.2. Some people donot do this and try to remember what the values mean. You can remember that 1is the code for a male and 2 the code for a female, but if the questionnaire containshundreds of items, remembering all the values would be impossible. You could relyon the codebook, but checking the codebook each time you come across a variable forwhich you’re not sure of the coding will take valuable time, often when you are facinga deadline.

Creating value labels pays off if you are going to use the dataset more than a fewtimes, or if you share a dataset with others. Output will be nicely labeled, which willmake it much easier for you to read and make sense of your results. Value labels areespecially helpful for those on the team who may not work with the data regularly, orwho work with some parts of the dataset irregularly.

Value labels are used when the numbers have some language equivalent, such as ascale from “strongly disagree” to “strongly agree”. If we record people’s ages, there isno label we could sensibly use for 16. There can be times when the number of valuesto be labeled is daunting (diagnostic codes, for example). In many cases, the labelsare available in usable form, and it’s just a matter of finding them. Ask colleagues,librarians, and data providers.

Labeling values in Stata is done in two stages. In the first stage, we create thelabeling scheme where the text and the value to which it is attached are saved. In thesecond stage, we assign the labeling scheme to the variable. Labeling schemes are givennames, which can be the same as variable names or different. When it makes sense todo so, you may want to give the variable and the labeling scheme the same name justto keep things simple. At other times you may choose a name that is descriptive of thetype of scale or measurement being labeled. The first step involves creating a schemefor each set of labels in your data set.

We will create a scheme called sex using 1 for male and 2 for female. We will createa scheme called rating using 1 for very poor, 2 for poor, 3 for okay, 4 for good, 5 forvery good, and -9 for missing. We will create a scheme called harsh using 1 for muchtoo lenient, 2 for too lenient, 3 for about right, 4 for too harsh, 5 for much too harsh,and -9 for missing. Finally, we will create a scheme called conserv using 1 for veryliberal, 2 for liberal, 3 for moderate, 4 for conservative, 5 for very conservative, and -9for missing.

Page 45: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.4.1 Labeling values 29

At the bottom of Figure 2.2, just about OK you see Define/Modify. Clicking on thisopens a new dialog in which we define schemes. This new dialog appears in Figure 2.3along with how we define the scheme sex.

Figure 2.3: Define schemes dialog box

Click Define in Figure 2.3, which opens the Define label dialog box in which weenter the name for the new labeling scheme, sex. The name can contain only letters,numerals, and the underscore—the same characters that can be used for variable names.We’ll call it sex, because we will use it for our gender variables. As soon as we click onOK a new dialog box opens.

We enter the first value, 1 and text for the label, male. When we click on OK weenter the second value, 2 and the text for its label, female. After clicking OK we enterthe third value, -9 and its label, missing and click OK. Since this is the last value forthis scheme, we are back in the menu in Figure 2.3 and here we click Add. This addsthese value labels to the scheme sex. Using this process we define the other schemes.When we are finished Figure 2.3 becomes Figure 2.4.

Figure 2.4: Define schemes dialog box

Close this dialog box and return we go back to the origianl dialog in Figure 2.2. Thistime click OK and then click at the top of the second column where it says var2. This

Page 46: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

30 Chapter 2. Entering Data

opens a dialog where we enter the variable name, gender, the variable label, Gender ofparticipant, and then we can click the arrow to get a list of the value label schemeswhere we click on sex to assign that scheme to the variable gender. This is the secondset in which we assign the value label schemes to the variables. The menu for doing thisfor gender is in Figure 2.3. We do this for each variable to assign the scheme ratingto both sch st and sch com, the scheme harsh to the variable prison, and the schemeconserv to the variable conserv.

Going through this series of menus can be confusing. This is one case where it maybe easier to simply enter the commands. The key thing to remember is that it is a twostep process. The first step defines a different scheme for each set of labels. The secondstep assigns these to the appropriate variables. Here is how we would enter the firststep:

. label define sex 1 "male" 2 "female" -9 "missing"

. label define harsh 1 "much too lenient" 2 "too lenient"

. label define harsh 3 "about right", 4 "too harsh", add

. label define harsh 5 "much too harsh" -9 "missing", add

. label define rating 1 "very favorable" 2 "Favorable"

. label define rating 3 "unfavorable" 4 "very unfavorable", add

. label define rating -9 "missing", add

. label define conserv 1 "very liberal" 2 "liberal" 3 "moderate"

. label define conserv 4 "conservative" 5 "very conservative", add

Why do we show the dot prompt with these commands?

When we show a listing of Stata commands we place a dot and a space in frontof each command. When you enter these in the command window, you enter thecommand itself and not the dot prompt or space. We include these because Stataalways shows commands this way in the Results window. Also, Stata manuals andmany other books about Stata follow this convention.

Note that we added the add option to the end of the second label define command,which indicates that we are adding labels to the con labeling scheme and not tryingto redefine it. What happens if we make a mistake in how we entered a label? In thedefinitions we just entered we defined the rating scheme value of 2 the text ”Favorable”.Sine we have not capitalized any of the other labels we might want to change this tolower case, ”favorable.” To do this, we enter the following command with the option ofmodify. Stata tries to protect us against ourselves and won’t let us change the definitionwithout explicitly including the modify option.

. label define rating 2 "favorable", modify

The second step is to assign these value labels to the appropriate variables. If we donot want to go through the menus we can enter these commands directly:

. label values gender sex

. label values sch_st rating

. label values sch_com rating

Page 47: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.5 Saving your dataset 31

. label values prison harsh

. label values conserv conserv

You may have experience with other statistical packages that do not use this twostep process. Assigning value labels to each variable, one at a time, would be reasonablefor a dataset that just has a few variables such as our example. When you are dealingwith complex datasets there may be dozens of variables that have the same scheme suchas strongly agree to strongly disagree or yes/no options. A complex data set might havejust 10 schemes, but 200 or more items. Once you have defined each scheme you canassign them to all the variables for which they apply. This is an efficient approach forall but the simplest datasets.

2.5 Saving your dataset

If you look at the Results window in Stata you can see that the menu has done a lotof work for you. The Results window shows that a lot of commands have been run tolabel the variables. These also appear in the Review window. The Variables windowlists all of your variables. We’ve created our first dataset.

Up till now, Stata has been working with our data in memory, and we still need tosave the dataset as a file, which we can do from the . File . Save As menu. This willopen the familiar Windows dialog box to choose where you would like to save your file.Once the data has been saved to a file, you can exit Stata or use another dataset andnot lose your changes.

Notice that Stata will give the file the .dta extension if you don’t specify one. It isimportant that you use this extension, as that is what identifies it as a Stata dataset.You can give it any first name you want, but don’t change the extension. Let’s call thisfile firstsurvey and let Stata add the .dta extension for us. If you look in the Resultswindow, you’ll see that Stata has printed something like

. save "C:\DATA\firstsurvey.dta"

It is a good idea to always look to make sure that you spelled the name as youthought, and that it got saved into the directory you intended. Note, also, that Stataadded quotation marks around the filename. If you type a save command into theCommand window yourself, you must use the quotes around the file name only if thereis a space in it, however, if you always use the quotes around the filename, you’ll neverget an error by forgetting to use them.

Stata works with one dataset at a time. If you wish to work with a different dataset,the new one will replace the current one in memory. When you open the new dataset,Stata will clear the current one from memory first. Stata knows when you have changedyour data, and you will be prompted if you attempt to replace unsaved data with newdata. It is, however, a very good idea to get in the habit of saving before reading newdata because Stata will not prompt you if the only things that have changed are labelsand notes.

Page 48: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

32 Chapter 2. Entering Data

A Stata dataset contains more than just the numbers, as we’ve seen in the workwe’ve done so far. A Stata dataset contains

• The numerical data

• The variable names

• The variable labels

• Missing values

• Formats for printing data

• Dataset label

• Notes

• Sort order and the name of the sort variables

In addition to the labeling we’ve already seen, you can store notes in the dataset,you can attach notes to particular variables, which can be a very convenient way todocument changes, particularly if more than one person is working on the dataset. Tosee all the things you can insert in a dataset, enter ubject)command, help labelthecommand help label. If the dataset has been sorted, that information is also stored.

We have now created our first dataset and saved it to disk, so let’s close Stata. Thisis done with the . File . Exit menu item.

2.6 Checking the data

We’ve created a dataset, now we want to check the work we did defining the dataset.Checking for the accuracy of our data entry is also our first statistical look at the data.To open the dataset, use the . File . Open menu. Locate the firstsurvey dataset thatwe saved in the preceding section, and click OK.

Let’s run a couple of commands that will characterize our dataset and the datastored in it in slightly different ways. We created a codebook to use when creating ourdataset. Stata provides a convenient way to reproduce much of that information, whichis useful if you want to check to make sure you entered the information correctly. It isalso very useful if you share the dataset with someone who doesn’t have access to theoriginal codebook. Use the . Data . Describe Data . Describe Data Contents (Codebook)menu to generate Stata’s codebook from the current dataset. Note that you can specifyvariables for which you want a codebook entry, or you can get a complete codebook bynot specifying any variables. Try it first without specifying any variables.

Page 49: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.6 Checking the data 33

Scrolling the Results

When Stata displays your result, it does so one screen at a time. When there ismore output than will fit in the Results window, Stata prints what will fit anddisplays more in the lower-left corner of the Results window. Stata calls thisa more condition in its documentation. To continue, you can click on the morein the Results window, but most of us find it easier to just press the space bar. Ifyou don’t want to see all of the output, that is, you want to cancel displaying therest of it, you can press the q key. If you do not like having the output pause thisway, you can turn this paging off, by typing the command set more off into theCommand window. All of the output generated will print without pause until youeither change this setting back or quit and restart Stata. To restore the paging,use the set more on command. To see what else you can set, enter the commandhelp set and you will get a long list of things you can customize.

Let’s take a look at the codebook entry for gender as Stata produced it. For thisexample, you can either scroll back in the Results window, or you can return to the. Data . Describe Data . Describe Data Contents (Codebook) menu and put gender intothe Variables box, as we’ve done in the examples that follow.

. codebook gender

gender participant’s gender

type: numeric (byte)label: sex

range: [1,2] units: 1unique values: 2 missing .: 0/20

tabulation: Freq. Numeric Label10 1 male10 2 female

Let’s go over the display for gender. The first line lists the variable name, gender,and the variable label, participant’s gender. Next, the type of the variable, which isnumeric (byte) is shown. The value label scheme associated with this variable is sex.The range of this variable (shown as the lowest value, then the highest value) is from 1to 2, there are two unique values, and we have 0 missing values in our 20 cases. This isfollowed by a table showing the frequencies, values, and labels. We have 10 cases witha value of 1, and these are labeled male, and 10 cases with a value of 2 and these arelabeled female.

This is an excellent way to check your categorical variables, like gender and prison.If, in the tabulation, you saw three values for gender or six values for prison, youwould know that there are errors in the data. Looking at these kinds of summaries ofyour variables will often help you detect data-entry errors, which is why it is crucial to

Page 50: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

34 Chapter 2. Entering Data

review your data after entry.

Let’s also look at the display for education.

. codebook education

education Years of education

type: numeric (byte)

range: [8,20] units: 1unique values: 10 missing .: 0/20

mean: 14.45std. dev: 2.94645

percentiles: 10% 25% 50% 75% 90%11.5 12 14.5 16.5 18

Because this variable takes on a larger number of values, Stata does not give us atabulation of the individual values. Instead, Stata gives us the range (8 to 20) alongwith the mean, which is 14.45, and the standard deviation, which is 2.95. It also givesthe percentiles. The 50th percentile is usually called the median and in this case themedian is 14.5.

With a large dataset the complete codebook may be more information than youneed. For a quicker, condensed overview there is an alternate command, describe.Enter the command into the Command window, or use the . Data . Describe Data. Describe Variables in Memory menu, which will display the dialog box in Figure 2.5.

Figure 2.5: Describe dataset dialog box

If you do not list any variables in the Variable box, then all the variables will beincluded in the output, as shown in the following output:

Page 51: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

2.6 Checking the data 35

. describe

Contains data from C:\DATA\firstsurvey.dtaobs: 20vars: 7 28 May 2004 09:39size: 240 (99.9% of memory free)

storage display valuevariable name type format label variable label

id int %8.0g Identification numbergender byte %8.0g sex participant’s gendereducation byte %8.0g Years of educationsch_st byte %9.0g approve Rating of schools in your statesch_com byte %9.0g approve Ratings of schools in your

community of originprison byte %16.0g length Rating of prison sentencesconserv byte %17.0g con Conservativism/liberalism

Sorted by:

As we can see by examining the output, the path to the data file is shown (directoryand file name), as is the number of observations, the number of variables, the size ofthe file, and the last date and time the file was written (also called a time stamp. Itis possible to attach a label to a dataset, too, and if we had done so, it would havebeen shown above the time stamp. Below the information that summarizes the datasetis information on each variable: its name, its storage type (there are several types ofnumbers, and string), the format Stata uses when printing values for the variable, thevalue label scheme associated with it, and the variable label. This is clearly not ascomplete a description of each variable as provided by codebook, but it is often justwhat you want.

One last thing we will do to check the dataset is to get a table of the standardsummary, or descriptive, statistics we usually want to see with numeric data. Use the. Data . Describe Data . Summary Statistics menu. If you do not specify any variables,we’ll get summary statistics for all the variables.

. summarize

Variable Obs Mean Std. Dev. Min Max

id 20 10.5 5.91608 1 20gender 20 1.5 .5129892 1 2

education 20 14.45 2.946452 8 20sch_st 18 3.444444 1.149026 2 5

sch_com 20 3.5 1.395481 1 5

prison 17 3.176471 1.550617 1 5conserv 19 2.947368 1.544657 1 5

Each variable appears as a row, with its name, the number of observations (which variesbecause we have missing values for some of the participants), the mean, the standarddeviation, the minimum value, and maximum value. Once again, we want to examine

Page 52: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

36 Chapter 2. Entering Data

these values to see if there is any indication that we may have errors in the data. Forexample, if the maximum value shown for education is 30, you might want to verifythat value. Another good check to perform here is to make sure that we have properlyconverted all of the codes we used for missing values when we entered the data intoStata missing values. Had we not done so, you would see minimum values listed thatwere negative numbers. We chose negative numbers to make it obvious when we lookat summaries like these whether we have converted them into missing values or not.

How long would it take you to compute the means and standard deviations for allof these variables by hand, even for just the 20 observations we have? Here we got theresult almost instantly. Using the computer really does save time, and a lot of it, whendoing statistics. The more proficient you become with the computer and Stata, theeasier it will be to generate useful statistics quickly and painlessly.

2.7 Chapter Summary

We covered the following topics in the chapter.

• How to create a codebook for a questionnaire.

• How to develop a system for coding the data.

• Entering the data into a Stata dataset.

• Adding variable names and variable labels.

• Creating value label schemes and associating them with variables.

• Converting our missing value codes to Stata’s missing values.

• Summarizing our dataset and our variables to check for errors and to get our firststatistical look at the data.

2.8 Exercises1. From the coding sheet sheet shown in Table 2.3 translate the numeric codes back

into the answers given on the questionnaire by using the coding sheet in Table 2.2.

2. Using the menu, attach a descriptive label to the dataset we created in this chapter.What command is printed into the Results window by the menu? You find thismenu under . Data . Labels.

3. Using the menu, experiment with Stata’s notes features. You can attach notes tothe dataset and to specific variables. See if you can attach notes, display them, andremove them. What might the advantage be to keeping notes inside the dataset?

4. Administer your own small questionnaire to students in your class. Enter the datainto a Stata Dataset, label the variables and values.

Page 53: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3 Preparing Data for Analysis

3.1 Introduction3.2 Plan your work3.3 Create value labels3.4 Reverse code variables3.5 Create and modify variables3.6 Create scales3.7 Save some of your data3.8 Summary3.9 Exercises

3.1 Introduction

A substantial portion of any research project is spent preparing for analysis. Exactlyhow much will depend on the data. If you collect and enter your own data, a majority ofthe actual time you spend on the project will not be analyzing the data, it will be gettingit ready to analyze. If you are doing secondary analysis and are using well-prepareddata, substantially less time may be spent.

Stata has an entire manual on data management that is nearly 500 pages long, DataManagement Reference Manual . If your data is always going to be prepared for youand you will only do statistics and graphs with Stata, you might skip this chapter. Wewill only cover a few of Stata’s capabilities for managing data. Some, perhaps many, ofthe data management tasks you perform you will be able to anticipate and plan, butothers will just arise as you progress through a project and you’ll need to deal withthem on the spot.

3.2 Plan your work

The data we will be using is from the U.S. Department of Labor, Bureau of La-bor Statistics, National Longitudinal Survey of Youth, 1997, which we will abbre-viate as NLSY97. We selected a set of four items to measure how adolescents feel

37

Page 54: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

38 Chapter 3. Preparing Data for Analysis

about their mother and another set of four parallel items to measure how adolescentsfeel about their father. We used an extraction program available from the web page(http://www.bls.gov/nls/quex/y97quexcbks.htm) and extracted a Stata dictionarythat contained a series of commands for Constructing the Stata dataset we will usealong with the raw data in an ASCII format. We created a dataset called relate.dta.In this chapter we will learn how to work with this dataset to create two variables,namely, Youth Perception of Mother and Youth Perception of Father.

What is a Stata dictionary file?

A Stata dictionary file is a special file that contains Stata commands and rawdata in plain text called ASCII format. We will not cover this type of file here,but you use this to import the raw data into a Stata dataset. Using the menu. File . Import opens a menu where we select ASCII data in fixed format witha dictionary. This takes us to a window in which we Browse for where we storedthe dictionary file. In this case, we find relate.dct. The extension .dct is alwaysused for a dictionary file. Sometimes the dictionary will be in one file and the rawASCII data will be in a separate file. Here they are both in the same file so wedo not need to point to an ASCII data file. Clicking OK creates the Stata dataset.We can then save it using the name relate.dta. The dictionary we used providesthe variable names and variable labels. Had the variable labels not been includedin the dictionary, we would have created them prior to saving the dataset. If youneed to create a dictionary file yourself you can learn how to do that in the DataManagement Reference Manual.

It is a very good idea to make an outline of the steps you will need to take goingfrom data collection to analysis. The outline should include what needs to be doneto collect or obtain the data, enter (or read) and label the data, make any necessarychanges to the data prior to analysis, create any composite variables, and finally, createan analysis-ready version (or versions) of the dataset.

An outline serves two vital functions: It provides direction so you don’t lose yourway in what can often be a complicated process with many details, and it provides youwith a set of benchmarks for measuring progress. For a large project, this latter functionshould not be underestimated, as it is the key to successful project management.

Our project outline, which is for a simple project on which only one person is working,is shown in Table 3.1.

This is a pretty skeletal outline, but it is a fine start for a small project like this.The larger the project, the more detail needed for the plan. Now that we have theNLSY-supplied codebook to look at, we can fill in some more details of our plan.

First, we will run a describe command on the dataset we created. Let’s just showthe results for the four items measuring the adolescents’ perception of their mother.From the documentation on the dataset we can determine that these items are called

Page 55: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.2 Plan your work 39

◦ Consult NLSY97 documentation, Appendix 9, to determine which vari-ables are needed (see page 23 in this Appendix that is a PDF file).http://www.bls.gov/nls/quex/y97quexcbks.htm

◦ Download the data and codebook (done for you as relate.dct and relate.cdb.This was done for you.

◦ Create a basic Stata dataset. This was done for you, see relate.dta.

◦ Create variable and value labels.

◦ Generate tables for variables to compare against the codebook to check for errors.

◦ Convert missing-value codes to Stata missing values.

◦ Reverse code those variables that need it; verify.

◦ Copy variables not reversed to named variables; verify.

◦ Create the scale variable.

◦ Save the analysis-ready copy of the dataset.

Table 3.1: Sample project task list

R3483600, R3483700, R3483800, and R3483900. Thus our command is: describeR3483600 R3483700 R3483800 R3483900. This produces the following description ofthe data:

. describe R3483600 R3483700 R3483800 R3483900

storage display valuevariable name type format label variable label

R3483600 float %9.0g MOTH PRAISES R DOING WELL 1999R3483700 float %9.0g MOTH CRITICIZE RS IDEAS 1999R3483800 float %9.0g MOTH HELPS R WITH WHAT IMPT TO

R 1999R3483900 float %9.0g MOTH BLAMES R FOR PROBS 1999

These variable names are not very helpful and we might want to change them later.For example, we my rename R3483600 to mopraise. The variable labels make sense,but notice that there are no value labels. We will read the codebook we downloadedwhen we downloaded this dataset. It is in a file called relate.cbk. Table 3.2 showswhat the downloaded codebook looks like for the R3483600 variable:

This is very helpful. We see that a code of 0 means Never, a code of 1 meansRarely, etc. We need to make value labels so our data will have nice value labels likethese. Also, notice that there are different missing values. They use a -1 for Refusal,-2 for Don’t know, -4 for Valid skip, and -5 for Non-interview. We are puzzled

Page 56: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

40 Chapter 3. Preparing Data for Analysis

--------------------------------------------------------------------------------R34836.00 [YSAQ-022]Survey Year: 1999

MOTHER PRAISES R FOR DOING WELL

How often does she praise you for doing well?

118 0 NEVER235 1 RARELY917 2 SOMETIMES1546 3 USUALLY1701 4 ALWAYS

-------4517

Refusal(-1) 20Don’t Know(-2) 3TOTAL =========>4540VALID SKIP(-4) 3669NON-INTERVIEW(-5) 775

Table 3.2: NLSY97 sample codebook entries

by the absence of a -3 code. Searching the documentation we learn that this was usedwhen there was an interviewer mistake or an invalid skip. We didn’t have any ofthese for this item, but might for others. We need to distinguish between these typesof missing values. The refusals and the don’t know responses are people who shouldhave answered the question but didn’t answer or didn’t know what they thought. Bycontrast, notice there are 3,669 valid skips. These are people who were not asked thequestion for some reason. We need to check this out and if we read the documentationwe learn that these people were not asked because they were over 14 years old and onlyyouth 12-14 were asked this series of questions. Finally, there were 775 youth who werenot interviewed. We need to check this out as well. This is 1999 data, the third yearthe data was collected and the researchers were unable to locate these youth or theyrefused to participate in the third year of data collection. In any event, we need todefine missing values in a way that keeps track of these distinctions.

How does our Stata dataset look compared to the codebook we downloaded. Wecan run the command codebook. If you are following this you might do this now. Wewill restrict it to just the single variable R3483600, by entering the command codebookR3483600 in the Command window.The problem with our data is apparent when youlook at this codebook, the absence of value labels makes this for our actual data unin-terpretable.

. codebook R3483600

R3483600 MOTH PRAISES R DOING WELL 1999

type: numeric (float)

range: [-5,4] units: 1

Page 57: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.3 Create value labels 41

unique values: 9 missing .: 0/8984

tabulation: Freq. Value775 -53669 -4

3 -220 -1118 0235 1917 21546 31701 4

Before we add value labels so our codebook looks more like the codebook we down-loaded, we need to define the missing values in our dataset. Because all of the variableshave the same negative values coded for the different types of missing values, we can as-sign missing values in a single command, mvdecode all,mv(-5=.a\-4=.b\-3=.c\-2=.d\-1=.e)

. mvdecode _all, mv(-5=.a\-4=.b\-3=.c\-2=.d\-1=.e)R3483600: 4467 missing values generatedR3483700: 4469 missing values generatedR3483800: 4468 missing values generatedR3483900: 4470 missing values generatedR3485200: 5608 missing values generatedR3485300: 5611 missing values generatedR3485400: 5610 missing values generatedR3485500: 5611 missing values generatedR3828100: 775 missing values generatedR3828700: 775 missing values generated

3.3 Create value labels

We will add value labels including labels for the missing values. We know from thedataset documentation that we will be reverse coding some of the variables becausesome items are stated positively (praises) and some are stated negatively (blames), sowhen we create value labels, we will want labels for the original coding and labels forthe reversed coding. We’ve chosen to call the labels for the parental relations itemsoften and those for the reversed items often r. The text of the labels is taken fromthe codebook file. Here are the commands to add the labels. You can, if you wish, usethe menus to do this.

. label define often 0 "Never" 1 "Rarely" 2 "Sometimes" 3 "Usually" 4 "Always"

. label define often .e "Non-interview" .d "Valid skip" .c "Invalid skip", add

. label define often .b "Don’t know" .a "Refusal", add

. label define often_r 4 "Never" 3 "Rarely" 2 "Sometimes" 1 "Usually" 0 "Always"

. label define often_r .e "Non-interview" .d "Valid skip" .c "Invalid skip", add

. label define often_r .b "Don’t know" .a "Refusal", add

We’ve entered value labels, but we haven’t yet attached them to any variables. Thenext task is to assign the value labels to the variables. This can be done in several

Page 58: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

42 Chapter 3. Preparing Data for Analysis

ways. You can open the Data Editor, then right click in any cell in each of the columns,to bring up the menu, then select the Assign Value Label to Variable, and choose theappropriate value label from the list. We’ve done this for R3483600 and the resultsappear in Figure 3.1. Notice that after we do this for R3483600 the value labels appearin the Data Editor for this variable.

Figure 3.1: Create new variable

Sometimes these value labels are what you want to see in the Data editor and some-times you need to see the actual numbers the value labels represent. If you want to seethe numbers instead of the labels, there is a simple command you enter in the Commandwindow, edit, nolabel. This opens the Data editor and shows the numerical valuesand the missing codes.

Another way you can assign the value labels is through the menu system. You coulduse the . Data menu, . Labels, . Label values, . Assign value labels to variable, as we didin Chapter 2. If you wanted to enter the commands directly, you could type commandsinto the Command Window:

. label values R3483600 often

. label values R3483800 often

. label values R3485200 often

. label values R3485400 often

We have used the value label called often for the positively stated items. We willassign the often r scheme to the other items after we reverse code them.

Page 59: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.4 Reverse code variables 43

3.4 Reverse code variables

We have two items for the mother and two for the father that are stated negativelyand we should reverse code these. The R3483700 item refers to the mother criticizingthe adolescent and the R3485300 item refers to the father criticizing the adolescent.The R3483900 and R3485500 refer to the mother and father blaming the adolescent forproblems. For these items, an adolescent that reports this always happens would meanthe adolescent had a low rating of his or her parent. We always want a higher score tosignify more of the variable. A score of 0 for never on this pair of items should be thethe highest score on these items, i.e. 4, and a response of always should be the worseresponse and would have the lowest score, i.e., 0.

It is good to organize all the variables that need reverse coding into groups based ontheir coding scheme. It is also a good idea to create new variables instead of reversingthe originals. This insures that we have the original information if any question arisesabout what we have done. We recommend that you again write things out before youstart working, as we’ve done in Table 3.3.

Old value New value

0 41 32 23 14 0

Table 3.3: Reverse coding plan

For a small dataset like the one we are using as our example, this may seem likeoverkill. When you are involved in a large-scale project, the experience gained fromworking out how to organize these matters on small datasets will stand you in verygood stead, particularly if you need to assign tasks to several people and try to keepthem all straight.

There are several ways that this recoding can be done, but we’ll use recode rightnow. From the . Data menu, choose . Create or change variables, . Other variabletransformation commands, . Recode categorical variable.

The recode command, and its dialog box, are straightforward. You provide thename of the variable to be recoded, then you give at least one rule for how it shouldbe recoded. Recoding rules are enclosed in parentheses, with an old value (or values)listed on the left, then an equals sign, then the new value. The simplest kind of rulelists one value on either side of the equals sign. So, for example, the first recoding rulewe would use to reverse code our variables would be (0=4).

Page 60: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

44 Chapter 3. Preparing Data for Analysis

Figure 3.2: Recode: specifying recode rules on the main tab

Figure 3.3: Recode: specifying new variable name on the options tab

Page 61: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.4 Reverse code variables 45

More on recoding rules

More than one value can be listed on the left side of the equals, as in (1 2 3=0),which would recode values of 1, 2, and 3 as 0. Usually more than one rule will bewanted, but occasionally you may just want to collapse one end of a scale, as with(5/max=5) which recodes everything from 5 up to but not including the missingvalues as 5, which might be used to collapse the highest income categories into one,for example. There is also a corresponding version to recode from the smallest valueup to some value. For example, (min/8=8) might be used to recode the values forhighest attained grade in school if you wanted the same score for everybody withfewer than 9 years of education coded the same way.

When you use recode, you can choose to recode the existing variables or createnew variables that have the new coding scheme. We strongly recommend creating newvariables. The dialog boxes shown in Figures 3.2 and 3.3 show how this is done for allfour variables being recoded. The first list is the original variables on the Main tab, asshown in Figure 3.2. The second list is the names of the new variables to be created onthe Options tab, as shown in 3.3. We started the started the variables for mothers withan mo and the variables for fathers with fa. We added an r at the end of each variableto remind us that they are reverse coded. Stata will pair up the names, starting withthe first name in each list, and recode according to the rules stated. You must have thesame number of variable names in each list. The output from the selections shown inFigures 3.2 and 3.3 is:

. recode R3483700 R3483900 R3485300 R3485500 (4=0) (3=1) (2=2) (1=3) (0=4), genera> te(mocritr moblamer facritr facrtm)(3230 differences between R3483700 and mocritr)(4037 differences between R3483900 and moblamer)(2562 differences between R3485300 and facritr)(3070 differences between R3485500 and facrtm)

The output above shows how Stata presents output that is too long to fit on oneline: it moves down a line and inserts a > character at the beginning of the new line.This differentiates command continuation text from the output of the command. If youare trying to use the command in output as a model for typing a new command, youshould not type the > characters. Notice the first line ends with genera and you woulddelete the > and spaces so the word would be generate

The recode command labels the variables it creates. The text is generic, and youmay wish to replace it. For example, the mocritr variable was labeled with the text:

RECODE of R3483700 (MOTH CRITICIZE RS IDEAS 1999)

That tells you the original variable name (and its label in parentheses), but it doesnot tell you how it was recoded. It may be worth the additional effort to create moreexplicit variable labels, as in

Page 62: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

46 Chapter 3. Preparing Data for Analysis

. label variable mocritr "Mother criticizes R, R3483700 reversed"

. label variable moblamer "Mother blames R, R3483900 reversed"

. label variable facritr "Father criticizes R, R3485300 reversed"

. label variable fablamer "Father blames R, R3485500 reversed"

The recode command is quite useful. Missing values can be specified either asoriginal values or as new values, and a value label can be created as part of the recodecommand. You can also recode a range (say from 1 to 5) easily, which is useful forrecoding things like ages into age groups. Anytime the resulting values are easily listed(most often integers, but they could be noninteger values), it is worth thinking aboutusing the recode command.

3.5 Create and modify variables

There are three primary commands used to create and modify variables: generate andegen are used to create new variables, and replace is used to modify the values ofexisting variables. We’ll illustrate the use of each of these commands in this section,but they can do much more than we are able to show.

The simplest kind of variable creation is where you simply assign the same valueto every observation in the new variable, which is called creating a constant variable.The next simplest kind of variable creation is where you simply copy the values fromone variable into a new variable. One of the tasks in Table 3.1 was to create copiesof some of the original variables that have more descriptive names than those suppliedwith NLSY97, and we’ll do that now.

To create a new variable using the menus, go to . Data menu, . Create or changevariables, . Create new variable, which will bring up the dialog box shown in Figure 3.4.The dialog is pretty simple. There’s a place for you to enter the name of the newvariable, and there’s a place for you to enter what the new variable should equal. Wewant to create a variable called id and it should contain whatever is in the variableR0000100.

You could enter the generate command directly in the Command window to do thesame thing, as in, as shown by the command that the dialog box generates:

. generate id = R0000100

To create a constant variable, simply replace the variable name on the left side of theequals sign with the value you would like the variable to have.

. generate one = 1

Stated generally, the way to assign values to a new variable with the generatecommand is simple: you put the command itself, then the name of the new variable,then an equals sign, and finally you type out some expression (or rule, or formula) thattells generate the value to put in the new variable. Here is what the commands torename the remaining items that did not need to have their coding reversed look like.

Page 63: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.5 Create and modify variables 47

Figure 3.4: Create new variable

. generate id = R0000100

. generate sex = R0346300

. generate age = R0346600

. generate mapraise = R34836000

. generate mahelp = R3483800

. generate fapraise = R3485200

. generate fahelp = R3485400

The generate expression can be as simple as a single number or the name of anothervariable, as shown above, or it can be quite complicated. You can put variable names,arithmetic signs, and many kinds of functions into expressions. Table 3.4 shows thearithmetic symbols that can be used. You should take note that attempts to do arith-metic with missing values will lead to missing values. So, in the addition example inTable 3.4, if sibscore were missing (say it was a single-child household), then the wholesum in the example would be set to missing for that observation. From the commandwindow you can type help generate and you will see several additional examples ofwhat can be done. .

Symbol Operation Example

+ addition mscore + fscore + sibscore- subtraction balance - expenses - penalty* multiplication income * .75/ division expenses/income^ exponentiation (x2) x^2

Table 3.4: Arithmetic symbols

For more complicated expressions, the order in which things happen can be impor-tant, and parentheses are used to control the order in which things are done. Parenthe-

Page 64: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

48 Chapter 3. Preparing Data for Analysis

ses contain expressions, too, and those are calculated before the expressions outside ofparentheses. Parentheses are never wrong. They might be unnecessary to get Stata tocalculate the correct value, but they aren’t wrong. If you think they make an expressioneasier to read or understand, use as many as you need.

Fortunately, the rules can be made pretty simple. Things are done reading from leftto right, and the order in which things inside expressions are calculated is:

1. Do everything in parentheses. If one set of parentheses contains another set, dothe inside set first.

2. Exponentiate (raise to a power)

3. Multiply and divide

4. Add and subtract

Let’s step through an example:

. generate example = weight/.45*(5+1/age^2)

When Stata looks at the expression to the right of the equals sign, it notices the paren-theses, so it looks inside those. There it sees that it first has to square age, then divide1 by the the result, then add 5 to that result. Once it is done with all the stuff inparentheses, it starts reading from left to right, so it divides weight by .45 and thenmultiplies that by the final thing it got from calculating what was inside parentheses.That final value is put into a variable called example.

Stata doesn’t care about spaces in expressions, but they can help readability. So,for example, instead of writing something like we just did, we can use some spaces tomake the meaning clearer, as in,

. generate example = weight/.45 * (5 + 1/age^2)

If you wanted to be even more explicit, it would not be wrong to write

. generate example = (weight/.45) * (5 + 1/(age^2))

Let’s take another look at reverse coding. We reverse coded variables by using a setof explicit rules like (0=4). But we could accomplish the same thing using arithmetic.Since this is a relatively simple problem, we’ll use it to introduce some of the things youmay need to be concerned about with more complex problems.

Reversing a scale is swapping the ends of the scale around. Our scale is 0 to 4, sowe can swap the ends around by subtracting the current value from 4. If the originalvalue is 4, and we subtract it from 4, we’ve made it a 0, which is the rule we specifiedwith recode. If the original value is 0, and we subtract it from 4, we’ve made it a 4,which is the rule we specified with recode.

This scale starts at 0, so to reverse it, you just subtract each value from the largestvalue in the scale, in this case 4. So, if our scale was 0 to 6, we’d subtract each value

Page 65: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.5 Create and modify variables 49

from 6; if it was 0 to 3, we’d subtract from 3. If the scale starts at 1 instead of 0,then you need to add one to the largest value before subtracting. So, for a 1 to 5 scale(6− 1 = 5 and 6− 5 = 1), you would subtract from 6; for a 1 to 3 scale, you’d subtractfrom 4 (4− 3 = 1 and 4− 1 = 3).

What we’ve said so far is correct, as far as it goes, but we aren’t taking into accountmissing values and/or their codes. Our missing value codes are −1 to −5, and if wesubtract those from 4 along with the item responses that aren’t missing value codes,we’ll end up with 4 − (−1) = 5 to 4 − (−5) = 9. So, we would need to add a secondmvdecode command for the reversed variables. Or, we could first convert our missingvalues, then do the arithmetic to reverse the scale.

Let’s just work with one variable, since this is an example. We’ve already appliedthe mvdecode command, so let’s reverse code R3485300 and call it Facritr (rememberthat Stata is case-sensitive, so facritr and facritr are different names). We use thesame dialog that is shown in Figure 3.4, except enter 4 - R3485300 into the Contentsof new variable box instead of just R3485300. The output is

. generate float facritr = 4 - R3485300(5611 missing values generated)

Whenever you see that missing values are generated (there are 5,611 of them in thisexample!), it is a good idea to make sure you know why. The variables we are workingwith only have a small set of values they can take, so we can compare the originalvariable with the new variable in a table and see what got turned into what. Fromthe . Statistics menu, choose . Summaries, tables, & tests, . Tables, . Two-way tableswith measures of association, which will bring up the dialog shown in Figure 3.5. We’veselected the two variables, and it is worth noting that we’ve checked two of the optionsboxes. We are interested in the actual values the variables take, so we selected theoption to suppress the value labels, and we are interested in the missing values, so wecheck the box to have them included in the table.

. tabulate facritr R3485300, miss

RECODE ofR3485300

(FATHCRITICIZE

IDEAS FATH CRITICIZE IDEAS 19991999) 0 1 2 3 4 .a .b Total

0 0 0 0 0 117 0 0 1171 0 0 0 247 0 0 0 2472 0 0 811 0 0 0 0 8113 0 1,078 0 0 0 0 0 1,0784 1,120 0 0 0 0 0 0 1,120.a 0 0 0 0 0 775 0 775.b 0 0 0 0 0 0 4,816 4,816.d 0 0 0 0 0 0 0 4.e 0 0 0 0 0 0 0 16

Total 1,120 1,078 811 247 117 775 4,816 8,984

Page 66: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

50 Chapter 3. Preparing Data for Analysis

Figure 3.5: Two-way tabulation dialog

RECODE ofR3485300

(FATHCRITICIZE FATH CRITICIZE IDEAS

IDEAS 19991999) .d .e Total

HLI100 0 0 1171 0 0 2472 0 0 8113 0 0 1,0784 0 0 1,120.a 0 0 775.b 0 0 4,816.d 4 0 4.e 0 16 16

Total 4 16 8,984

From the table generated, we quickly see that everything happened as we anticipated.Those adolescents who had a score of 0 on the original variable (column with a 0 at thetop, now have a score of 4 on the new variable. There are 1,120 of these observations.We need to check the other combinations as well. Also note that the missing value codeswere all transferred to the new variable. Our new variable facitr is a reverse codingof our old variable R3485300.

Page 67: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.6 Create scales 51

Beyond egen

We have seen how generate and egen cover a wide variety of functions. You canalways enter the command help generate or help egen to see a description ofall the functions that are available. Sometimes this is not enough. Nicholas J.Cox has written and continues to update an command called egenmore that addsmany additional functions that are helpful for data management. You can enterssc install egenmore, replace in your Command window and this will installthese added capabilities. The option replace will check for updates he has addedif egenmore is already installed on your machine. If you then enter help egenmoreyou will get a description of all the capabilities Cox has added.

3.6 Create scales

We are finally ready to calculate our scales. There are two scale variables that willbe constructed, one for the adolescent’s perception of his or her mother and one theadolescent’s perception of his or her father. With four items, each scored from 0 to4, the scores could range from 0 to 16 points. Higher scores indicate a more positiverelationship with the parent.

At first glance, this is straightforward. We just add the variables together, whichgives us

. generate ymorelate = mopraise + mocritr + mohelp + moblamer

. generate yfarelate = fapraise + facritr + fahelp + fablamer

Before we settle on this, we should understand what will happen in the case of missingvalues, and we should check through the documentation to find out what they did aboutmissing values. The first thing to do is to determine how many observations answered,one, two, three, or all four of the items.This introduces a new command and dialog box:egen shown in Figure 3.6.

From the . Data menu, select . Create or change variables, . Create new variable(extended) to display the egen dialog. As you can see, there is a box into which you putthe name of the variable to be created which we will call mamissing because this willtell us how many of the items about mothers that each adolescent missed answering.The next step is to select the function from the list. In this case, we want to look amongthe row functions for one that will count missing values, which appears as . Row numberof missing values. Last, we enter the items about the mother in the box provided. Theseselections will generate the command

. egen mamissing = rowmiss(mapraise macritr mahelp mablamer)

If we then look at a tabulation of the mamissing variable, the rows will be numbered 0to 4, and each entry will indicate how many observations have that many missing values.

Page 68: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

52 Chapter 3. Preparing Data for Analysis

Figure 3.6: The extended generate dialog

For example, there are 4,510 observations for which all four items were answered (nomissing values), 6 for which there are three items answered (one missing value), 2 forwhich there are 2 answered items (two missing values), and 4,466 for which there areno answered items.

. tab mamissing

mamissing Freq. Percent Cum.

0 4,510 50.20 50.201 6 0.07 50.272 2 0.02 50.294 4,466 49.71 100.00

Total 8,984 100.00

From this, we can see that there are a total of 8 observations for which there is some,but not complete, data, There should be little doubt what to do for the 4,510 cases withcomplete information, nor for the 4,466 with no information (all missing). The problemwith using the sum of the items as our score on ymorelate is that the 8 observationswith partial data will be dropped, i.e., given a value of missing.

There are several solutions. One thing we might decide to do is to compute themean of the items that are answered. We can do this with the egen command. We canget this from the egen menu using the . Data menu, select . Create or change variables,. Create new variable (extended) to display the egen dialog as we displayed in Figure 3.5.However, this time we pick will call our scale mameana and pick row mean from the listof egen functions. This generates a mean of the items that were answered, regardless

Page 69: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.6 Create scales 53

of how many were answered, as long as at least one item was answered. If we do atabulation for mameana we have 4,518 observations. This mean that everybody whoanswered at least one item has a mean of the items they answered.

Another solution is to have some minimum number of items, say 75% or 80% ofthem. We might only include people who answered at least 3 of the items. Thesepeople would have fewer than 2 missing items. We can go back to our menu for egenand rename the variable we are computing mameanb. Clicking on the tab for by/if/inwe can stipulate the condition that mamissing < 2 as shown in Figure 3.7.

Figure 3.7: The extended generate dialog

The Stata command this generates is egen float mameanb = rowmean(mopraisemocritr mohelp moblamer) if mamissing < 2. This creats a variable, mameanb, thatis the mean of the items for each observation which has at least 3 of the 4 items answered.Doing a tabluation on this shows that there are 4,516 observations with a valid score.We were able to keep the 6 cases that answered all but one of the item and we droppedthe 2 cases that were missing more than one item.

Page 70: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

54 Chapter 3. Preparing Data for Analysis

Setting how much output is in the Results window

If you are working in a class and doing homework is the extent of your Statausage, you might not run out of space in the Results window. If you are workingon a research project, or need to generate more than a few models and see theirresults, you will almost certainly want to increase the number of lines that you canscroll back in the Results window. You do this from the . Prefs menu, . GeneralPreferences. Click on the Windowing tab. The default size for the scrollback bufferis 32000 characters. Make it at least 200000, which is about .2 MB. You may alsowant to turn the More feature off using the command set more off. If you enterthe command help set you will find many other things you can personalize aboutthe Stata interface.

Deciding among different ways to do something

When you need to decide among two or more different ways of performing a datamanagement task—which is what we’ve been doing this whole chapter—it is usuallybetter to choose based on how easy it will be for someone to read and understandthe record rather than based on the number of commands it takes or based onsome notion of computer efficiency. The computing of the row mean has a majoradvantage over computing the sum of items. The mean is on the same 0-4 scale thatthe items have and this often makes more sense than a scale that has a differentrange. For example, if you had 9 items that ranged from 1 for strongly agree to5 for strongly disagree, a mean of 4 would tell you that the person has averagedagree over the set of items. This might be easier for users to understand than atotal or sum score of 36 on a scale that ranges from 9 (1 on each item) to 45 (5 oneach item).

3.7 Save some of your data

There will be many occasions when you wish to save only part of your data. You maywish to save only some variables, or you may wish to save only some observations.Perhaps you have created variables that were needed as part of a calculation but donot need to be saved, or perhaps you wish to work only with the created scales not theoriginal items. You might want to drop some observations based on some characteristic,such as sex or parent’s educational level or geographic area.

To drop variables from your dataset use . Data menu, choose . Variable utilities,. Eliminate variables, which will display the dialog box shown in Figure 3.8. To dropobservations from your dataset use . Data, . Variable utilities, . Eliminate observations,which will display the dialog box shown in Figure 3.9.

Page 71: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.7 Save some of your data 55

Figure 3.8: Selecting variables to drop

Figure 3.9: Selecting observations with an expression

Page 72: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

56 Chapter 3. Preparing Data for Analysis

Sometimes it is easier to list the variables you wish to keep instead of those you wishto drop. As you can see from Figure 3.8, that’s done from the same dialog box. Clickthe Keep button, then check the box to keep variables, and fill in the list of variablesyou want to keep.

Often the criterion for selecting what to save to a file is not the variable, but theobservation. For example, there might be a large, master dataset, and you wish towork only with those 14 and younger. Again, you have the choice to specify whichobservations by specifying which to keep or which to drop. To select by observation,select the Keep or Drop button, then fill in the expression. Figure 3.9 show the dialogfilled in to keep respondents age 14 and younger.

We’re going to save two versions of our dataset. We’ll save one that contains all thevariables, and we’ll save one from which we’ve dropped the original R0 variables. Thecommands to do this are:

. * Save the master dataset before dropping variables

. save nlsy97_mstr, replace .

. * Drop the original items because we no longer need them

. drop R0000100-R0538200

. * Reorder the variables when saving to make them easier to read

. order id sex age ymorelate yfarelate m* f*

. save nsly97_fp, replace

Note two things about the save command: We’ve left off the .dta extension, andwe’ve used the replace option. Stata automatically appends .dta to the file name, sowe can leave it off. We use the replace option so that if we rerun these command, wewon’t get an error because nsly97 fp.dta already exists.

3.8 Summary

This chapter has covered a lot of ground. You may need to refer back to this chapterwhen you are creating your own datasets. We’ve covered how to label variables andcreate value labels for each possible response to an item. In learning how to create ascale we covered reverse coding of items that were stated negatively and how to createand modify items. We also covered how to create a scale and work with missing values,especially where some people who have missing values are included in the scale andsome are excluded. Finally, we covered how to save parts of a file whether the partswere selected items or selected observations.

In the next chapter we’ll look at how to create a file containing Stata commands thatcan be run as a group. Our webpage has such a command file for the current chapter(c3.do. By recording all the commands into a program we have a record of what wedid and we can re-run it or modify it in the future. We’ll also see how to record bothoutput and commands in files.

Page 73: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

3.9 Exercises 57

3.9 Exercises

1. Open the relate.dta Stata dataset. The variable R3828700 is the sex of theadolescent. It is coded as a 1 for males and 2 for females. There are a numberof people who are missing because they dropped out of the study after a previouswave of data collection. These people have a missing value on R3828700 of -5.Modify the variable so that the code of -5 will be recognized by Stata as a missingvalue. Then do a tabulation of R3828700.

2. Using the result of the first exercise, generate a new variable called male that hasa value of 1 for males and a value of 0 for females. Do a two way table of maleand R3828700 that shows you did this correctly.

3. Using the result of the second exercise, assign value labels to male so a 1 says Maleand a 0 says female. Do a tabulation of the variable male that shows the labeling.Next, enter the command set numlabel, add in the Command window. Thenrepeat the tabulation of the variable male. What is the difference?

4. The relate.dta dataset is data from the third year of a long term study. Becauseof this, some of the youth who were adolescents the first year (1997) are over18. Assign a missing value on R3828100 (age) for those with a code of -5.Dropobservations that are 18 or over. Keep only R0000100, R3483600, R3483800,R3485200, R3485400, R3828700. Save this as a dataset called positive.dta.

5. Using the positive.dta dataset, assign missing values to the four items R3483600,R3483800, R3485200, R3485400. Copy these four items to four new items calledmompraise, momhelp, dadpraise, dadhelp.

6. Create a scale called parents of how youth relate to their parents using the fouritems (R3483600, R3483800, R3485200, R3485400). Do this separately for boys(if R3828700 == 1) and girls (if R3828700 == 2). Use the row mean functionto create your scale. Do a tabulation of parents.

Page 74: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University
Page 75: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

Author index

BBureau of Labor Statistics, Appendix 9

. . . . . . . . . . . . . 39

CCox, N. J., egenmore . . . . . . . . . . . . . . . 51

MMaker, trouble . . . . . . . . . . . . . . . . . . . . . . . 1

SStataCorp,Data Management reference

Manual . . . . . . . . . . . . . . . . . . . . 37

Page 76: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University
Page 77: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

Subject index

Ccodebook . . . . . . . . . . . . . . . . . . . . . . . 22, 40coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Command Window . . . . . . . . . . . . . . . . . . 4command, clear . . . . . . . . . . . . . . . . . . . . . 26command, codebook . . . . . . . . . . . . . . . . 32command, describe . . . . . . . . . . . . . . . . . 34command, destring . . . . . . . . . . . . . . . . . 27command, driven . . . . . . . . . . . . . . . . . . . . 2command, drop observations . . . . . . . . 54command, drop variables . . . . . . . . . . . 54command, edit, nolabel . . . . . . . . . . . . . 42command, egen . . . . . . . . . . . . . . . . . . . . . 46command, egen rowmiss . . . . . . . . . . . . 51command, egenmore . . . . . . . . . . . . . . . . 51command, generate . . . . . . . . . . . . . . . . . 46command, generate math functions. .47command, generate rowmean . . . . . . . 53command, histogram. . . . . . . . . . . . . . . .10command, keep observations . . . . . . . . 54command, keep variables. . . . . . . . . . . .54command, label define . . . . . . . . . . 30, 41command, label values . . . . . . . . . . 30, 42command, more . . . . . . . . . . . . . . . . . . . . 33command, mvdecode . . . . . . . . . . . . 41, 49command, replace . . . . . . . . . . . . . . . . . . 46command, saving . . . . . . . . . . . . . . . . . . . 31command, set more off . . . . . . . . . . . . . . 54command, variable labels . . . . . . . . . . . 28conventions used in the book . . . . . . . 16

DData Editor . . . . . . . . . . . . . . . . . . . . . . . . 26Data Editor, nolabels . . . . . . . . . . . . . . . 42dataset, contents. . . . . . . . . . . . . . . . . . . .32dataset, create . . . . . . . . . . . . . . . . . . . . . . 19Datasets used, cancer.dta . . . . . . . . . . . . 8

dialog box, submit vs ok . . . . . . . . . . . . 15Dialog box, summarize . . . . . . . . . . . . . . . 9dictionary file . . . . . . . . . . . . . . . . . . . . . . .38Do-files, introduction . . . . . . . . . . . . . . . . 2

Eentering data . . . . . . . . . . . . . . . . . . . . . . . 26exit Stata. . . . . . . . . . . . . . . . . . . . . . . . . . .15

Fformat, numeric . . . . . . . . . . . . . . . . . . . . 27format, string. . . . . . . . . . . . . . . . . . . . . . .27

Ggenerate, math functions . . . . . . . . . . . . 47GUI interface, Preferences . . . . . . . . . . . 3

Hhelp, videos . . . . . . . . . . . . . . . . . . . . . . . . . 18help, web base . . . . . . . . . . . . . . . . . . . . . . 18

Mmissing values . . . . . . . . . . . 22, 24, 41, 49missing values, types. . . . . . . . . . . . . . . .40more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

NNLSY97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

OOpen, Stata installed dataset, sysuse. .7Open,existing dataset . . . . . . . . . . . . . . . . 7options, add . . . . . . . . . . . . . . . . . . . . . . . . 30options, modify . . . . . . . . . . . . . . . . . . . . . 30

PPreferences . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Page 78: A Gentle Introduction to Stata - Oregon State Universitypeople.oregonstate.edu/~acock/hdfs361/stata1_3.pdf · A Gentle Introduction to Stata Alan Acock Oregon State University

62 Subject index

project outline . . . . . . . . . . . . . . . . . . . . . . 38

Rrecoding, ranges . . . . . . . . . . . . . . . . . . . . 45Results Window, scroll size . . . . . . . . . 54Results, more . . . . . . . . . . . . . . . . . . . . . . . 54Results, Window. . . . . . . . . . . . . . . . . . . . .4reverse code variables . . . . . . . . . . . . . . . 43Review Window . . . . . . . . . . . . . . . . . . . . . 4

Sscale creation . . . . . . . . . . . . . . . . . . . . . . . 51schemes, s1monochrome . . . . . . . . . . . . 12Stata Screen . . . . . . . . . . . . . . . . . . . . . . . . . 3summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . 9sysuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

UUCLA Stata portal . . . . . . . . . . . . . . . . . 18

Vvalue labels. . . . . . . . . . . . . . . . . .20, 28, 41value labels, assign. . . . . . . . . . . . . . . . . .42variable label . . . . . . . . . . . . . . . . . . . . . . . 20variable name . . . . . . . . . . . . . . . . . . . 20, 22variable names. . . . . . . . . . . . . . . . . . . . . .20Variables Window . . . . . . . . . . . . . . . . . . . 4

WWorking directory . . . . . . . . . . . . . . . . . . . 7