# Chapter 5: REGRESS

## 5.1 Introduction

The REGRESS module fits linear, logistic, Poisson, polytomous and proportional hazards regression models. All the keywords for the DESCRIBE models are also applicable here. This chapter provides additional commands relevant for performing a regression analysis. One main difference is that DESCRIBE uses the Taylor Series Linearization method for variance estimation but REGRESS uses the Jackknife Repeated Replication technique to estimate design-based variances (Kish and Frankel 1974).

## 5.2 REGRESS Statements

### 5.2.1 Models

DEPENDENT variable name;

This statement specifies the name of the dependent variable in the regression model. Dependent variables are assumed to be continuous unless the CATEGORICAL keyword is included as described below.

PREDICTOR variable list;

This specifies the right hand side of the regression model. Predictor variables are assumed to be continuous unless they are defined as CATEGORICAL as described below. Interaction terms can be specified by using the '*' notation. For example,

PREDICTOR Income Age Income*Age;

LINK defines the type of regression model to be fit. Specify Linear for fitting a multiple linear regression model, Logistic for fitting a logistic (binary) or generalized logistic (polytomous) regression model, Log for fitting a Poisson regression model for a count variable, Tobit for fitting a tobit model or Phreg for fitting Proportional Hazards model (Cox model).

CENSOR variable name (number);

variable name is the censoring variable, and number is the code indicating censoring. If the number is omitted then, by default, 1 will be considered as the code indicating censored observation. The Censor statement is required if the LINK is specified as Phreg. For example,

DEPENDENT Survivaltime;
CENSOR Died (0);

In this example, the outcome variable is Survivaltime and the censoring variable is Died where Died=0 denotes censored observations.

CATEGORICAL variable list;

declares that the listed variables are to be treated as categorical. If a variable with k categories is listed on the CATEGORICAL and PREDICTOR statement then k-1 predictors (dummies) will be included in the regression model. The category with the highest code value will be the reference category. For logistic and multinomial logit models, the dependent variable must also be listed in the variable list.

OFFSETS count-variable(offset-variable);

This statement is used to specify an offsets variable when fitting a Poisson regression model. For example,

OFFSETS Injuries(Years);

will fit a model predicting the number for injuries occurring per year.

ID variable name;

Specifies the variable to be used as the unique subject identifier. This allows for linking the PREDOUT file (see below) created by the REGRESS module to other files.

NOINTER;

This keyword will fit regression models without the intercept term.

ESTIMATES label: specification;

This is useful for estimating values of the dependent variable for a specific set of covariates or testing hypotheses involving the estimated regression coefficients. For example, suppose that the following regression model is fit:

Y = b0 + b1x1 + b2x2 + b3x3

and we are interested in predicting Y for x1 = 1, x2 = 2 and x3 = 0. We can obtain the predicted value and the 95% confidence interval by using the following statement:

ESTIMATES Mylabel : Intercept (1) x1(1) x2(2);

Several estimates can be requested by separating them with the symbol: '/ ' .

### 5.2.2 Output files

The REGRESS module can be used to produce several plots and outputs for later processing. The following are the descriptions of these features.

PLOT filename;

This keyword creates a series of diagnostic plots including residual, leverage, influence and normal probability plots. The plots will be stored in the filename specified after the PLOT keyword. The user can rely on the built-in graphics produced internally or use GNU Plot by downloading this package and including the path in the XML settings file, see Chapter 9 for examples.

PREDOUT filename;

outputs a file containing the predicted values, their standard errors and 95% confidence intervals. If an ID statement is included in the setup, an ID variable is also included in the data set.

ESTOUT filename;

Outputs a file containing estimates and their variances-covariances.

REPOUT filename;

Outputs a file containing estimates for each replicate. Estimated regression coefficients are provided for each combination of STRATUM, CLUSTER and BY variable.

### 5.2.3 Design Variables

The design features can be specified using the commands STRATUM, CLUSTER, and WEIGHT as illustrated in the DESCRIBE chapter.

1. If the STRATUM, CLUSTER and WEIGHT variable are not specified, then a simple random sample analysis will be performed.
2. If a design based analysis involves only a WEIGHT variable and no STRATUM or CLUSTER variable, then a pseudo-stratification variable and a pseudo-cluster variable should be used. When using pseudo variables, all observations in the data set should have the same value for the pseudo STRATUM variable (e.g., 1), while each observation should have a unique value on the pseudo CLUSTER variable (e.g., observation ID number or SAS system variable N ). The pseudo variables should be created in the data prior to performing the analysis. Example SAS data step code for creating a pseudo STRATUM variable and a pseudo CLUSTER variable:``` LIBNAME MYLIB C:\MYINDIR; DATA MYLIB.MYDATA; SET MYLIB.MYDATA; PSEUD_STRAT=1; PSEUD_CLUST=_N_; RUN; ```Note that the inclusion of pseudo variables will increase the time REGRESS needs for analysis.

TITLE text \n text;

Indicates the title(s) to be printed at the top of each page of the printout. A \n indicates that the text that follows should be printed on the next line. For example,

TITLE This is the title on the first line \n This is the title on the second line;