BEGINNING HANDS-ON EXAMPLES

From this point on we will work with interactive SAS with the DMS.
 

Inputing Data 1: Read Data Embedded in the SAS Commands (in the Program Editor)

This exercise shows how to read three variables using free format. Free format means there is no need to indicate the column locations of variables, because values of different variables are separated by one or more spaces. The data reside in the Program Editor, along with the SAS instructions. In this exercise, after reading the data into a temporary SAS data set called inc, the means procedure produces descriptive statistics.


Exercise: Read Data from the Program Editor/Compute Descriptive Stats

Start SAS in interactive mode with the Display Manager System (DMS) --

     sas &
Then type the following lines in the Program Editor. If you make a mistake, backspace over the error and retype.

income1.sas

Note: The first field of numbers holds values for the variable gender, the second field holds values for race, and the third field holds values for persinc. Be sure the data for gender are in column 3, the data for race are in column 5, and the data for persinc are right-justified in columns 7 and 8.

These alignments are not important for the current exercise, but are needed for a later exercise.

Before submitting this program, check the help menu for the Program Editor -- Tools/Options/Program Editor. Go to the Editing tab. Notice two options: (1) Clear text on submit and (2) Split lines on carriage return. You may want to select the 1st one of these two and deselect the 2nd.

Deselecting "Clear text on submit" prevents the Program Editor from clearing when you submit a job. And selecting "Split lines on a carriage return" instructs the Program Editor to insert a line after the current line when you press the Enter key.

Also, it's worth repeating that you can set the F3 key to clear the Log Window and Output Window each time you use the F3 key to submit a job --

F3 Set

Now run the program and view the results -- press the F3 key or select Run/Submit from the menus. You can activate any window by clicking on it with the mouse or by selecting the View menu and selecting the window you wish to view. Results are displayed in the Log and Output windows.

If you made no typing errors, output from the means procedure is displayed in the Output Window.

If you have no output check the Log Window for errors. You can move the windows so that both the Log Window and Program Editor can be viewed side by side. Notice the color coding of the instructions in the Program Editor. The colors help you to spot problems: Blue for SAS commands, black for options and green for numeric values and character strings. If you can't find the mistake, I will come around to help you. (Note: If you did not deselect "Clear text on submit," recall the program by pressing the F4 key.)

The Log Window lists the commands in the Program Editor and gives several pieces of information, identified by "NOTE:" For example, it tells how many cases are written to the SAS data set, and how many cases were read by the means procedure. The Output Window lists output from SAS procedures.

This program reads a few cases from inside the Program Editor into a temporary SAS data set called inc. This SAS data set is read by the means procedure to produce means, standard deviations, etc. Unless instructed otherwise, SAS procedures always read the most recently created SAS data set. The function of each statement is --

options ls=80; Restrict the output linesize to 80 characters or less per line
data inc; Start the data step and give the name inc to the temporary SAS data set
input gender race persinc; Input data into three variables: gender, race, and persinc
if persinc > 23 then persinc=.;  Set persinc (personal income) to missing if it is greater than 23
datalines; Indicate that data are to follow (data end with a semicolon on a line by itself)
run; Indicate the end of the data step, instruct SAS to run it (optional statement, but recommended)
title ...; Places a title at the top of each page in the output. This is useful for negotiating through output after you have accumulated a lot of it.
proc means; Calculate descriptive statistics for all variables
run; Indicate the end of the proc step and instruct SAS to run it (optional statement, but recommended)

The variable names stand for the values in each field of numbers. The means procedure calculates the number of cases (n), the mean, standard deviation, minimum, and maximum for each variable. So, for example, 1.5 (the mean for gender) is the mean of the first column of numbers:

1 1 1 2 1 2 2 1 2 2

Inputing Data 2: Read Data from an External File

Reading data into SAS from the Program Editor is convenient when you have a small amount of data, but it is inconvenient to include large data files in the Program Editor, and some data files may be too wide to fit in the Program Editor. The most often used way to read raw data into SAS is from an external data file. To illustrate this procedure, place the data following the datalines statement in the Program Editor into a new file called income1.data.


Exercise: Reading Data from an External File

Start the pico editor by typing pico at the UNIX command prompt. Then use your mouse to copy and paste the data lines form the Program Editor into pico. The result should look like --

income1.data

Save the file by pressing ^X and responding to the prompts with the letter y then the Enter key.

Next, edit the commands in the Program Editor by taking out the data lines and inserting an infile statement. To delete a block of lines, place dd in the prefix beside the first line to be deleted and another dd beside the last line to be deleted, then press Enter. To insert a line, place the letter i in the prefix area above where you want to insert the line. (See an image of the prefix commands.) Note: If you set "Split lines on a carriage return," you can add a line by pressing the Enter key at the end of the line above where the new line is to be inserted.

Finally, change "Example 1" in the title statement to "Example 2". The result should look like --

Read Data

To avoid extra typing, save this file; call it income2.sas. Click File/Save As and type the file name in the dialog.

Now run the program and view the results -- press the F3 key or select Run/Submit from the menus. By default, SAS output to the saslog is appended to existing content of the Log Window, and output from procedures is appended to existing content of the Output Window.

Now the Results Window may be useful. Click the + beside the two "Means:" entries and widen the window so you can read the entire titles, so that your window looks like --

Results Window

Right click the output you want to view (here "Summary statistics") and release on Open. The Output Window scrolls to the output you select. Note that the descriptive statistics reported by the means procedure are exactly the same for both sets of output.


Exercise: Reading Data from an External File: Fixed Format

This exercise repeates the previous two but the input statement now designates exact column(s) spanned by each variable. Type the following lines into the Program Editor --

Read data example

This exercise explains why the 1st Exercise instructs you to be sure the data reside in fixed locations.

Also save this file, and call it income3.sas. Click File/Save As and type the file name in the dialog.

The numbers that follow each of the variable names, gender, race, and persinc, indicate the columns spanned by the values for each variable. Gender starts in column 3 and ends in column 3. Race starts and ends in column 5, but persinc spans two columns; it begins in column 7 and ends in column 8. (Column here used in the first sense.)

The spacing of the column-location numbers in the Program Editor is done for readablity. Only one space is required between the variable name and the columns spanned, and it is not necessary to start each variable on a new line. But these formating conventions do make the program easy to read and facilitate spotting errors.

Once again run the program and view the results -- press the F3 key or select Run/Submit from the menus. SAS output to the saslog again is appended to existing content of the Log Window, and output from procedures again is appended to existing content of the Output Window. -- unless you edited the function of the F3 key so it clears both the log and output windows, and you submitted the job with the F3 key.

This exercise illustrates an additional SAS procedure, the print procedure. By default, it lists all variables and all observations to the Output Window. Select the output from the print procedure in the Results Window and observe that it lists back the numbers that were read from the input file, income1.data. But it reformts the listing for readability, including column headers, for example.


Exercise: Reading Data from a Large Data File -- the GSS

Most large data sets stored as text require fixed-column input because the fields associated with each variable are not separated by any delimiter. For example, the first 10 lines and 72 columns of the data file associated with the the General Social Survey Cumulative file, 1972-2004, with a column marker added at the top, are --
----+----1----+----2----+----3----+----4----+----5----+----6----+----7--
2004   17    2                    1  22140-1      1   56739 60     1   8
2004   27    2                    1  228-1-11     1   27432691     2   9
2004   3140        22     999  9905                                     
2004   4148        21     875284713   2                            2   8
2004   5136        21      14648425                                     
2004   64          22     267316405                                2   7
2004   7135        22     57551 605                                2   4
2004   8299        22     33931 605                                2   8
2004   9225        22     274325915                                2    
2004  10140        22     379348415                                2   5
The spaces are not delimiters. Rather they indicate missing information. Obviously, you need a codebook and fixed-column input to read a file like this. Codebooks include a list of the variables and the columns associated with each one. In this example, the first variable is year of the survey. It is named YEAR and is located in the first four columns. The second variable is an "Original ID Number" and is located in columns 5-8. The next variable is called WRKSTAT in the codebook and is located in column 9, etc. (WRKSTAT stands for the work status of the respondent: Employed full time, employed par time, unemployed, ....)

This exercise shows one example of reading data from a large data file, the 1972 -- 2004 General Social Survey (GSS) Cumulative File. The data file contains results for every year the survey was conducted up to and including 2004. We will use data from 2004 only. The survey has been administered to a national sample of some 1,500 respondents nearly every year from 1972 through 1993. Beginning in 1994 the survey was conducted every other year, and the sample size was doubled to about 3,000 respondents. The survey is conducted by the National Opinion Research Corporation (NORC) at the University of Chicago. It contains a variety of questions including basic demographics, socioeconomic information, and opinions on many issues. A new sample of respondents is used every year, so the data consist of a sequence of cross-sectional samples -- not a panel design.

Computer files are sectioned into records. Each record is analogous to a line on a sheet of paper. Usually, but not always, each record contains data for one observation, e.g, one respondent to the GSS. Note that many screen pagers such as less wrap lines that are too long to fit on the screen. Only the display is wrapped; the file is not changed.

Users of the data learn how to read it by looking at the codebook. For each variable, the GSS codebook gives --


Clear out the Program Editor and type in the following program:

Read GSS data

You can copy and paste the long file name from here:

'/hsm/users/dsacher/icpsr/sn4295/data/04295-0001-Data.txt'
but not from the image.

Note:

Usually you do not use variables just as they are extracted from a raw data file. When defining variables for use in analysis you can use names that appeal to you. For example, you may create a dummy variable called female that is 1 for female respondents and 0 for male respondents.

To complete the exercise, clear (if necessary) your Log and Output windows and run this job. It takes a little time to run, because it reads in 46,510 records. But the output for proc means reports just 2,812. This is because we kept just the surveys done in 2004. Look at the NOTE s in the Log Window to check these numbers.

The output from proc means shows no variation for the variable YEAR, because only year 2004 is kept in the SAS data set. Each of the statistics is reported with three places after the decimal, as instructed by the maxdec option. Check the codebook for the meaning of the numbers for the income variable, rincome98. They are codes standing for income intervals. Consequently, descriptive statistics such as the mean and standard deviation provide little useful information.

Before continuing, save this SAS program. Go to the Program Editor, recall (if necessary) the program. Select File/Save As and save it under the name gssread.sas.