Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Class 1
Introduction to SAS
Reading data
1. Inline
2. External file
a. Text file
b. Excel file
Invoking SAS procedures
SAS 软件使用
About this Class There might not be an exam
My evaluation of your achievement is based on
your homework and periodical quizzes in class.
After each class, you should save your SAS files
and upload them to BB.
The same is for class quizzes.
A note of warning: If I find anyone copying
other's homework, both will get a zero grade.
Repeated offenders will fail this class!
Introduction to SAS SAS can be used interactively, but its most
power is in batch mode, with which I’m
most familiar
SAS is made up with two steps:
Data step
Procedure step
Everything in SAS is accomplished by one
of this step.
This is the most fundamental concept in
SAS
SAS's approach to data analysis Before we conduct any SAS statistical analysis,
we first need to get data into SAS
We might need to do something with the data: modify
it or create new variables, for example.
We might need to combine data from different places,
in some ways.
We either need to stack them on top of each other, or
we need to merge them together.
Once data is organized, we can analyze.
In the real world, most of our time is spend
on getting data organized.
SAS’s Data Step In the first step, you use SAS to create a dataset
for SAS procedures to work with.
SAS’s power is most demonstrated here.
There seems to be no data that SAS can’t read.
It all depends on the programmer to write codes
for SAS to read your data.
SAS data is viewed as variables in columns and
observations in rows.
It does not normally tread data as a matrix!
So you can't normally access random element of
a dataset, only one row at a time.
SAS’s Procedure Step
Once your data has been created, using SAS
procedures is truly a simple matter.
There are countless predefined SAS
procedures that can take care of just about
every statistical analysis that you can think of.
The richness of those procedures makes SAS
stand out as the king of statistical analysis!
Lets start with some very simple examples.
Please open and run ex2-3-1.sas.
An Textbook Example (ex2-3-1.sas)
DATA ex1; /*此为数据步的开始,建立名为ex1的数据集*/
INPUT A B; /*读入数值变量A,B的值*/
DATALINES ; /*以下是数据行*/
23 45
34 56
;
RUN; /*数据步结束 (Optional)*/
PROC PRINT; /*此为过程步的开始*/
RUN; /*过程步结束,运行本程序*/
Data step In our last example, the first line:
“DATA ex1;”
“Data” is the SAS key word, in blue
ex1 is the name of the new dataset, and
“;” ends the statement.
The 2nd line:
“INPUT A B;”
“Input” is the SAS key word,
A B are two variables’ name to input
The 3rd line:
“DATALINES ;”
Tells SAS what follows are data itself
It must be the last statement within the data step!
SAS outputs Once a program segment has been
submitted, there are two main outputs from
SAS:
Log (日志)
List program running information
Output(输出)
List actual output of each procedure
Lets all first open and run the ample code
“ex2-3-1” from within SAS and see the
actual log output (see next)
SAS log file1 *ex2-3-1; /*此为注释语句*/2 DATA ex1; /*此为数据步的开始,建立名为ex1的数据集*/3 INPUT A B; /*读入数值变量A,B的值*/4 DATALINES ;
NOTE: 数据集 WORK.EX1 有 2 个观测和 2 个变量。NOTE: “DATA 语句”所用时间(总处理时间):
实际时间 0.28 秒CPU 时间 0.00 秒
4 ! /*以下是数据行*/7 ;8 RUN; /*数据步结束*/9 PROC PRINT; /*此为过程步的开始*/10 RUN;
NOTE: 有 2 个从数据集 WORK.EX1 读取的观测。NOTE: “PROCEDURE PRINT”所用时间(总处理时间):
实际时间 0.42 秒CPU 时间 0.03 秒
10 ! /*过程步结束,运行本程序*/
As we can see, the log file keeps tab of program running information, as a record of SAS’s work
Lets see another example next
2nd Textbook Example (ex2-3-2.sas)*ex2-3-2; /* 一个星号时单行评语,*/
DATA score;
INPUT num $ name $ English computer ; /*读入4个变量 */
DATALINES; /* 数据开始语句 也可用CARDS语句 */081 ZHANGLIN 88 90
082 ZHAOHUA 99 89
083 WANGQANG 78 96
084 LIULI 84 79
085 SHIDONG 69 88
086 KONGYING 77 79
087 LILING 82 67
088 GUANFEN 80 91
091 MAQIANG 66 78
092 NEWHUA 88 99
; /*分号为数据结束语句 */
PROC MEANS; /*调用MEANS过程*/
RUN;
Lets all open and run this ample SAS code, and I will try to explain
what it does in greater details
Input character variable
In this example, the 2nd line:
“INPUT num $ name $ English computer ;”
Again, “Input” is the SAS keyword
This time, we are inputting 4 variables, for example:
081 ZHANGLIN 88 90
Here we want to take “081” as text, not a number
We need a way to tell SAS, so we do it by adding a “$”
after the variable’s name in the input statement
Since there is a “$” after the first 2 variable names,
they are considered character variables in SAS
The last two variables are numerical variables
Lets see the actual running of the SAS code next
Two SAS outputs: Log and ListSAS log file output:
11 *ex2-3-2;
12 DATA score;
13 INPUT num $ name $ English computer ; /*读入4个变量 */
14 DATALINES;
NOTE: 数据集 WORK.SCORE 有 10 个观测和 4 个变量。NOTE: “DATA 语句”所用时间(总处理时间):
实际时间 0.01 秒CPU 时间 0.00 秒
14 ! /* 数据开始语句 也可用CARDS语句 */
25 ; /*分号为数据结束语句 */
26 PROC MEANS; /*调用MEANS过程*/
27 RUN;
NOTE: 有 10 个从数据集 WORK.SCORE 读取的观测。NOTE: “PROCEDURE MEANS”所用时间(总处理时间):
实际时间 0.62 秒CPU 时间 0.04 秒
SAS list output:
SAS 系统 2010年04月22日 星期四 下午04时48分01秒 2
MEANS PROCEDURE
变量 N 均值 标准差 最小值 最大值
---------------------------------------------------------------------------
English 10 81.1000000 9.5852897 66.0000000 99.0000000
computer 10 85.6000000 9.6861872 67.0000000 99.0000000
Data Storage in SAS
So far, we’ve created SAS datasets that
once we quit SAS, they are gone.
If we want to save SAS dataset for later use,
we need to store it in a permanent location.
This is done in SAS by providing a location
of a folder or directory with the “Libname”
statement.
The data created will have a two level name:
libref.SAS-data-set.
Reading data from a
text file
Reading External Data
Most of the time, we need to read data more
than within the programming code.
SAS has a rich set of approaches to read
external data.
Here we will show a simple way to read external
data.
We will provide SAS with the location of the
external file with a “filename” statement.
Please open the file “Attend.sas”.
The following is what you will see.
An Example of Regression with SASoptions linesize=120;
filename datain 'D:\teaching\Data\textfiles\ATTEND.raw'; /* location of the external raw input file */
libname dataout 'D:\teaching\sas\sasfile'; /* location of the permanent sas dataset */
data dataout.attend;
infile datain ;
input attend termGPA priGPA ACT final atndrte hwrte frosh soph skipped stndfnl
;
prigpa2 =prigpa**2; /* creating new variables within data step */
act2 =act**2;
attendprigpa=attend * prigpa;
label attend ="classes attended out of 32"
termGPA ="GPA for term"
priGPA ="cumulative GPA prior to term"
ACT ="ACT score"
final ="final exam score"
atndrte ="percent classes attended"
hwrte ="percent homework turned in"
frosh ="=1 if freshman"
soph ="=1 if sophomore"
skipped ="number of classes skipped"
stndfnl ="(final - mean)/sd“
prigpa2 =“PriGPA^2”
act2 =“Act ^2”
attendprigpa=“attend * PriGPA”
;
proc means; /* just be sure we did read the data correctly */
proc reg ;
eq_6_18: model stndfnl=attend prigpa act prigpa2 act2 attendprigpa;
run;
I will explain all of this later
Reading data from a text file In a DATA step,
Data can be read from a text file,
Output can be saved in a permanent location
In our last example, we had:
filename datain 'D:\teaching\Data\textfiles\ATTEND.raw';
/* please modify the location of your data file here */
libname dataout 'D:\teaching\sas\sasfile';
/* you need to modify this location also */
data dataout.attend;
infile datain ;
input attend termGPA priGPA ACT final
atndrte hwrte frosh soph skipped stndfnl
;
Please make the changes first
before running the program!
Matching labels In a DATA step,
Data can be read from a text file,
Output can be saved in a permanent location
In our last example, we had:
filename mydatain
'D:\teaching\Data\textfiles\ATTEND.raw'; /* please
modify the location of your data file here */
libname mydataout 'D:\teaching\sas\sasfile';
/* you need to modify this location also */
data mydataout.attend;
infile mydatain ;
input attend termGPA priGPA ACT final
atndrte hwrte frosh soph skipped stndfnl
;
We can give it any
matching name we
want, as long as it
is not a key word.
Reading data from a text file After we have defined filename as “datain”, and gave a
location for it, we can use it in the DATA step by
specifying with the statement “Infile datain;”
We have also defined a location for a permanent sas
dataset with the statement “Libname”, and called it
“dataout”, so that we can save the dataset we are going to
create in the data step as “dataout.attend”.
In this case, we can think of “Libname” as a folder name
This time, we don’t have to input the data from our sas
program statement.
Instead, we read the input variables directly from the
external text file.
Alternative Location Specification We can specify a folder rather than a single file for
filename statement, as in:
filename datain 'D:\teaching\Data\textfiles';
/* here we are showing the folder’s name only */
libname dataout 'D:\teaching\sas\sasfile';
/* you need to modify this location also */
data dataout.attend;
infile datain(ATTEND.raw) ;/* we list the file here. */
input attend termGPA priGPA ACT final
atndrte hwrte frosh soph skipped stndfnl
;
Location of external file There are many ways to tell SAS where the external file
is located in the DATA step, for example
1. Reading directly:
infile 'D:\teaching\Data\textfiles\ATTEND.raw';
2. Telling SAS where the file is located first with filename
statement filename datain 'D:\teaching\Data\textfiles\ATTEND.raw';
…
infile datain ;
3. Telling SAS the folder where the data is, as in:
filename datain 'D:\teaching\Data\textfiles';
…
infile datain(ATTEND.raw)
What are the advantages of each?
Adv: simple.
Dis-adv: berried inside
the code.
Adv: State on the top of the code, easy to modify.
Dis-adv: each file each name.
Adv: State on the top of the
code, easy to modify.
Dis-adv: more complex
Creating new variables in data step We can create new variables in the data
step, as we’ve done, for example:act2 =act**2; /* SAS way of saying act2 */
Since SAS variable names used to be only
8 characters long, we might want to place a
more descriptive label for each variable as:label attend ="classes attended out of 32“
termGPA="GPA for term"…;
We need to end our label statement with “;”.
Invoking SAS regression procedure After we have created a SAS dataset, we might want
to check it out with:
proc means; /* just be sure we did read the data correctly */
Next, we can run an OLS regression as:
proc reg ;
eq_6_18: model stndfnl=attend prigpa act prigpa2 act2
attendprigpa;
run;
To run a regression using “Reg” procedure, we need
a model statement, and we place the
dependent=a list of independent variables;
Be sure to end the statement with “;”.
More on OLS Regression
In this example, we also gave a label for the model (it is
the equation 6.18 of Wooldridge’s textbook)
If we have multiple model statement for a single “Reg”,
we will need labels to identify each output with a label.
A label is a (max 8 letters) word ending with “:”, as in
“eq_6_18:”.
In the “proc reg” procedure, by default, we are using the
data we’ve just created.
Otherwise, we can use data= option to indicate which
data to use.
Homework Assignment Please replicate the regression result of 6.4iii of
Wooldridge on page 223 using data HTV.
The regression equation is:
Log(wage) = β0+β1educ +β2pareduc
+β3pareduc*educ +β4expr +u
Notice the pareduc is not in the dataset.
Please post your regression outputs from SAS
and SAS log without editing to your blackboard
account.
I might come-up with a quiz for next class based
on today's class and the homework assignment!
A note about your homework For homework, you can get help, but you need to try it
yourself at the end.
For those that are trying to help others, don't just give
them the codes, but to explain how you did your work.
There will be a penalty for simply copying other's code.
For all those that copied homework from each other,
each group that have been identified, a point is taken for
the number of people in each group.
For example, if I found 5 people in a group sharing the
same set of sas codes, then 5 points will be deducted for
every person in the group, regardless of who copied
from whom.