# Chapter 7: SYNTHESIZE

## 7.1 Introduction

SYNTHESIZE uses the multivariate sequential regression approach to create full or partial synthetic data sets to limit statistical disclosure (See Raghunathan, Reiter and Rubin (2003), Reiter (2002) and Little, Liu and Raghunathan (2004) for more details). All item missing values will also be imputed when creating synthetic data sets. However, DESCRIBE, REGRESS and SASMOD modules cannot be used to analyze synthetic data sets as they DO NOT implement the appropriate combining rules. See examples in later chapters for demonstration of correct combining rules.

Except for the command IMPLICATES which specifies the number synthesized data sets to be generated, SYNTHESIZE commands are the same as those for IMPUTE.

## 7.2 SYNTHESIZE Statements

### 7.2.1 Variable Types

SYNTHESIZE requires that all variables be defined by type. Six types of variables are recognized by the IMPUTE module: continuous, categorical, count, mixed, transfer and drop. If no variable types are specified, all variables will be assumed to be continuous. Variable types should be declared before any BOUNDS, INTERACT, or RESTRICT statements.

CONTINUOUS variable list;

Variables declared as CONTINUOUS may take on any value on a continuum. For example, income is a continuous variable. A normal linear regression model is used to synthesize the missing values in these variables. You may want to transform the variable to achieve normality and then use SYNTHESIZE on the transformed scale. After imputation you may re-transform the variable back to its original form.

CATEGORICAL variable list;

CATEGORICAL variables have values that represent discrete values. For example, gender is a categorical variable. A logistic or generalized logistic model is used to impute missing categorical values.

MIXED variable list;

Variables declared as MIXED are both categorical and continuous. In a mixed variable, a value of zero is treated as a discrete category, while values greater than zero are considered continuous. Alcohol consumption is an example of a mixed variable. A two stage model is use to impute the missing values. First, a logistic regression model is used to impute zero versus non-zero status. Then, conditional on imputing a non-zero status, a normal linear regression model is used to impute non-zero values.

COUNT variable list;

COUNT variables have non-negative integer values. A Poisson regression model is used to impute the missing values. The number of annual doctor visits is an example of a COUNT variable.

DROP variable list;

Variables listed after the DROP keyword will be excluded from the imputation procedure and will not appear in the imputed data set.

TRANSFER variable list;

Variables listed after the TRANSFER keyword are carried over to the imputed data set, but are not imputed nor used as predictors in the imputation model. Transfer variables, however, can be used in the RESTRICT and BOUNDS statements (see below). ID is an example of a variable that you may want to treat as a transfer variable.

DEFAULT variable list;

Variable type can be Continuous, Categorical, Count, Mixed, Transfer or Drop. This keyword declares that by default all the variables in the data set should be treated as the selected variable type. The most efficient use of the DEFAULT statement is to declare the most numerous variable type in your data set as the default type, eliminating the need to type a long list of variables.

Optional Statements

RESTRICT variable(logical expression);

This command is used to restrict the imputation of a variable to those observations that satisfy the logical expression. For instance, suppose that the variable Yrssmoke indicates the number of years an individual smoked, and the variable Smoke takes the value 1 for a current smoker, 2 for a former smoker or 0 for someone who never smoked. Then the declaration:

will impute Yrssmoke values only for current and former smokers. It will automatically set Yrssmoke equal to 0 for those who never smoked. Restrictions on more than one variable may be combined as follows:

RESTRICT Yrssmoke(Smoke=1,2) Births(Gender=2) Income(Employed=1);

When the restriction is not met, the value of the restricted variable will be set to zero for continuous variables and one higher than the highest observed code for categorical variables.

BOUNDS variable (logical expression);

This keyword is useful for restricting the range of values to be imputed for a variable. For example,

BOUNDS Yrssmoke (> 0, <= Age-12);

will ensure that the imputed values for Yrssmoke are between > 0 and the individuals Age minus 12. Smoking is assumed not to begin before the age of 12. Again, as in the RESTRICT statement more than one variable can be included in the BOUNDS statement. For example,

BOUNDS Yrssmoke (>0, <= Age-12) Numcig(>0);

Model-Building Statements The following commands are useful in the specification of the imputation model.

INTERACT variable*variable;

This keyword enables the users to specify interaction terms to be include in the imputation regression model.

INTERACT Income*Income, Age*Race;

In this example, the imputation model for all the variables will include a square term for Income and an interaction term of Age and Race.

Options for Stepwise Regression

MAXPRED number; MAXPRED varlist2 (number);

Specifies the maximum number of predictor variables to be included as predictors in the regression model. A step-wise regression procedure is used to select the best predictors subject to the maximum number. Setting MAXPRED to a small number of predictors will greatly reduce the computational time especially for a very large data sets but the imputations will not be fully conditional.

For example,

MAXPRED 5;

will include the five best predictor variables, that is, the five making the largest contribution to the R-squared statistic. You can also restrict the number of predictors for selected variables.

For example,

MAXPRED Income (7) Educ (3);

will limit the number of predictors of Income to the seven largest contributors to the Rsquare, while the number of predictors of the variable Educ are limited to the three largest contributors. For other variables, all variables will be used as predictors.

MINRSQD decimal;

Specifies the minimum marginal R-square for a stepwise regression, that is , minimum initial marginal R-square for a logistic regression, and minimum initial R-square for any model being predicted by a polytomous regression. This option can reduce computation time. A small decimal number like 0.005 would build very large regression models whereas 0.25 will include a smaller number of predictors in the regression models. If neither MAXPRED nor MINRSQD is set then no stepwise regression will be performed.

MINRSQD 0.01;

In the above example, only variables with minimum additional R-square of 0.01 or higher will be included as predictors.

MAXLOGI number;

Specifies the maximum number of iterative algorithms to be performed in a logistic or multi-logit regression model. The default is 50. This is useful if the Newton-Raphson algorithm used in producing maximum likelihood estimates does not converge after 50 iterations. This applies to the convergence criterion for the logistic, polytomous and Poisson regression models. You can check whether you have such a non-convergence problem by inspecting the log file (e.g., mysetup.log).

MINCODI decimal;

Specifies the minimum proportional change in any regression coefficient to continue the logistic regression iteration process. This applies to the convergence criterion for the logistic, polytomous and Poisson regression models.

ITERATIONS number;

Specifies the number of cycles you would like the imputation program to carry out for each variable and implicate/multiple. You can specify any number greater than or equal to 2. Current investigations show that about 10 cycles are sufficient for most imputations. You may want to experiment with several values and check the differences in the resulting analysis.

IMPLICATES number;

Indicates the number of synthesized data sets to be created. By default, only a single synthesized data set is generated.

MULTIPLES number;

You can perform imputation within the SYNTHESIZE procedure. The value of the Multiples option indicates the number of imputations to be performed. By default, only a single imputation is generated. Note that IMPUTE is processed prior to SYNTHESIZE. For each multiple, a set of synthesized data sets are created based on the number of implicates specified. For example, if 2 multiples and 5 implicates are specified then 10 synthesized data sets will be created; five for multiple 1 and 5 for multiple 2.

PERTURB instruction;

The keyword PERTURB followed by an instruction of COEF or SIR allows the user to control perturbations of imputed values. By default, the IMPUTE module will perturb model coefficients using a multivariate normal approximation of the posterior distribution and the predicted values using the appropriate regression model conditional on the perturbed coefficients. This is equivalent to using the COEF instruction. SIR uses the Sampling-Importance-Resampling algorithm to generate coefficients from the actual posterior distribution of parameters in the logistic, polytomous or Poisson regression models (See Rubin 1987a, Raghunathan and Rubin 1988, Raghunathan 1994, Gelman, et. al 1995). This is appropriate in situations where normal approximation to the posterior distribution is not appropriate.

SEED number;

Specifies a seed for the random draws from the posterior predictive distribution. This number should be greater than zero. A zero seed will result in no perturbations of the predicted values or the regression coefficients. If the SEED keyword is missing from the setup file, then the seed will be determined by your computer's internal clock.

NOBS number;

The NOBS option indicates the number of observations to be used in the analysis. By default, all observations in the data set will be used. You might use NOBS to subset a large data set while testing your setup file.

OFFSETS count variables (offset variable) ;

This statement is used to specify an offsets variable when fitting a Poisson regression model. For example,

OFFSETS Injuries(Years);

will fit a model predicting the number for injuries occurring per year.

PRINT instruction;

Indicates the printout desired. The options are STANDARD, DETAILS, COEF, and ALL. For the STANDARD and DETAILS keywords instruct IVEware to print the number and distribution of observed values, imputed/synthesized values, and combined observed and imputed/synthesized values for each variable. The keyword COEF instructs additional printing of the unperturbed and perturbed coefficients for each iteration of each imputation/synthesization. When the ALL keyword is used, in addition to the above, the coefficient covariance matrix for each iteration of each multiple imputation is also printed.

A list of the variables used in the imputation/synthesizatio model is also printed with columns indicating the number of observed cases and the number of imputed cases for each of the variables. The third column of the variable list, labeled double counted, is to be used for diagnostic purposes. This entry should be zero. A non-zero entry indicates that the imputed value of a restricting variable has caused the observed value of a restricted variable to be set to the restricted value (zero for continuous variables, one higher than the highest observed code for categorical variables; see RESTRICT above). This usually indicates problems with the restriction or an inconsistency in the observed data. In either case, you should run a data step before the imputation to check the appropriateness of the restriction or correct the data inconsistency.

For example, if the variable SMOKE, indicating whether or not a respondent smokes, is missing and the variable YRSMK, indicating the number of years the respondent has smoked, is observed, then logically the respondent should be classified as a smoker. If SMOKE is not given a value indicating the respondent is a smoker in a SAS data step prior to imputation, the missing value could possibly be imputed to a nonsmoker value, causing the IMPUTE/SYNTHESIZE command to change the observed value for YRSMK to zero.