History and Future of Binary Segmentation Programs
By: James N. Morgan, Peter W. Solenberger, and Pauline R. Nagara
1. The binary segmentation method
Three common goals of research are to find whether there is an effect of X on Y with a low null probability, to estimate the size of that effect, and to estimate how much knowing X reduces the error in predicting Y. The first of these becomes trivial with samples in the thousands — anything worth looking at will be statistically significant. The second must deal with the possibility that there are different effects of X on Y for different subgroups in the population. If an increase in income goes mostly to the affluent who save a lot, the effect on consumption will be smaller than the population estimate. For very important covariates like income or education or age, one can search for groups with different levels of Y and regression slopes of X on it. Or one can use X as one predictor and check each subgroup for the effect of X.
Much statistical literature focuses on testing models with small samples. But when we have samples of 1000 or more, and a rich set of explanatory variables (and hence of competing models), different approaches are called for. We are selecting among a vast array of potentially mis-specified models.
Ever since Yule and Kendall wrote on grouping, it has been known that the loss of information from converting a variable into a few categories or groups, is very small (Yule and Kendall, 1937). Graham Kalton in an unpublished memo showed that for a variety of distributions, four or five groups accounted for well over 90% of the variance (Kalton, 1967).
That means that one can often explain more and better if each predictor is a set of categories represented by 1/0 (dummy) variables, because non-linear effects are allowed. The Multiple Classification Analysis program in the old Osiris software produced, for each category of each predictor, the unadjusted mean of the dependent (criterion) variable and the adjusted mean (or both as deviations from the grand mean), allowing one to see what the multiple regression was doing.
But one could also allow for interaction effects (non-additivity among explanatory variables) by moving to a sequential binary segmentation, using reduction in unexplained (error) variance to select the best split using the best predictor. Since one starts with substantial degrees of freedom, anything of any importance would be highly significant. But a stopping rule needed to take account of significance, by specifying a minimum group size or a null probability.
2. Binary segmentation programs
We first described such an automatic interaction detection program in an article in the Journal of the American Statistical Association (Morgan and Sonquist, 1963). Earlier, William Belson in England had proposed a segmentation approach in marketing (Belson, 1959). An actual program, called the Automatic Interaction Detector, was developed with funds from the National Science Foundation and described in a monograph (Sonquist and Morgan, 1964). It was later improved and described in another monograph (Sonquist, Baker and Morgan, 1974) and then improved again and renamed Search. It was heavily used in a research project published as Productive Americans (Morgan, Sirageldin, and Baerwaldt, 1966). It was also described in a chapter of a book on data analysis (Fielding, 1977). An overview of AID and other such appeared in 1992 (McKenzie and Low, 1992). A comparison of many programs with a categorical criterion variable revealed only small differences (Lim, Loh, and Shih 2000). Our comparisons with analyses of means also showed great similarity in the results.
Since then, a PC version of Osiris, called MicrOsiris, with rich recode, Multiple Classification Analysis, and Search has been made available by Van Eck Computer Consultants for a small charge. SPSS has a Tree program, Salford Systems has CART, and there are many “data mining” programs of varying complexity and cost being marketed (see www.kdnuggets.com).
Such programs are not mindless ransacking, since one must specify the variables and procedure in advance, but unless the user withholds a representative sample for testing the results, it is easily possible that the last few splits are idiosyncratic. However, the first ones are extremely reliable.
3. Enhanced Search
We are making available an enhanced version of Search, with documentation, at no cost, except for consultation, on the web at http://www.src.isr.umich.edu/software/search-software-for-exploring-data-structure/. It allows four types of criterion variable: a mean, a simple covariance, a set of categories, and a ranking of categories. The splitting criterion is reduction in error variance for the first two, a likelihood ratio chi square for the third, and a rank correlation for the fourth. The stopping rules include the number of splits, minimum final group size, minimum error reduction, or maximum null probability, even though significance testing after so much searching is hardly legitimate. Since graphic diagrams of the resulting “tree” become cumbersome with complex criterion variables, we have added a hierarchical summary table which, with a little editing, neatly displays the main results in publishable form (see the example below). Finally, we have added a warning when data weights are used, providing an estimate of the design effect that makes significance tests non-conservative.
The University of Michigan version of Search is a set of C and FORTRAN routines that can be launched from R, SAS, SPSS or Stata or run independently using data from many sources. It is currently available for personal computers using the Linux, Mac OS X and Microsoft Windows operating systems.
All the optional niceties of the old Search program remain: premiums for symmetry, a table showing the power of rejected alternative splits, code for residuals, priority ranking of predictors, predefined splits, and a variety of available output, including a branching diagram for means. So one can use an analysis of data excluding missing values to assign missing values to those cases (reducing the variance, but with large samples that is usually trivial). Or one can use the recode of the probability of being included to weight data for selection bias.
One might think that with correlated predictors, splitting on one would eliminate the other from being used, but frequently the second still has some effect. If a predictor has no effect overall, or on any of the subgroups, one can safely conclude that it is not effective. We introduced a look-ahead option in an earlier version to seek out offsetting interaction effects, but we never found any, so we dropped it. We tried a “slopes-only” option with the covariance analysis but found the results unreliable, most of the variance being from the level anyway. The covariance analysis is useful in examining the effects of income and family size changes on changes in housing, food consumption, etc. (Morgan, 1985, 1990). The recode of expected values or residuals allows treating these as a new variable in analyzing other data, or in introducing explanatory variables presumed to come later in a causal sequence. It allows analysis of sequences of causation, where some predictors like age, gender, race clearly precede education and mobility, and they precede current occupation and earnings and marital status.
The “chi” analysis with a categorical criterion variable, sometimes called “decision tree” is useful in classification or diagnosis, focusing not on modal values (best guess) but on whole distributions. If the patient is in a class where most have one disease, a good doctor would like to know what other possibilities are still important, i.e., the whole distribution. The “rank” with a ranked criterion is useful where a numerical variable like saving or change in income is skewed, and where the extreme cases are often the result of conceptual or measurement error. Converting to 5 groups or so loses little information and eliminates distortion from extreme cases. The “regression” or covariance option is useful when there is one dominant or particularly important variable like income, education, gender, or race. One wants to know not whether there is an effect, but just where it is most important.
Since basic knowledge is what economists refer to as a “social good” whose value increases with use and which one can give without losing, and since the National Science Foundation funded the original development, we believe that only the costs of distribution and of giving advice and consultation should be charged.
4. Example of Customizing Output for Presentation
The following hierarchical summary table is part of the standard output for a Search run segmenting a sample on the basis of hourly earnings.
HOURLY EARNINGS Group Tree Structure Group 1: All Cases N=5235, Sum(WT)=98087.1, Mean(Y)=13.0220 Group 2 V17545: V17545: EDUCATION 1989 H, Codes 1-6 N=4186, Sum(WT)=73028.4, Mean(Y)=10.8072 Group 10 V16973: V16973: D1 CHKPT, Codes 1 N=2535, Sum(WT)=40813.7, Mean(Y)=12.5852 Group 12 V17545: V17545: EDUCATION 1989 H, Codes 1-4 N=1420, Sum(WT)=22130.2, Mean(Y)=11.0112 Group 13 V17545: V17545: EDUCATION 1989 H, Codes 5,6 N=1115, Sum(WT)=18683.5, Mean(Y)=14.4495 Group 14 V16631: V16631: AGE OF 1989 HEAD, Codes 1-3 N=703, Sum(WT)=9154.10, Mean(Y)=12.5821 Group 15 V16631: V16631: AGE OF 1989 HEAD, Codes 4-8 N=412, Sum(WT)=9529.40, Mean(Y)=16.2434 Group 18 V16306: V16306: SIZE LGST CITY/C, Codes 1-4 N=287, Sum(WT)=6513.40, Mean(Y)=17.8262 Group 19 V16306: V16306: SIZE LGST CITY/C, Codes 5,6 N=125, Sum(WT)=3016.00, Mean(Y)=12.8249 Group 11 V16973: V16973: D1 CHKPT, Codes 2,3 N=1651, Sum(WT)=32214.7, Mean(Y)=8.55455 Group 3 V17545: V17545: EDUCATION 1989 H, Codes 7,8 N=1049, Sum(WT)=25058.7, Mean(Y)=19.4765 Group 4 V16631: V16631: AGE OF 1989 HEAD, Codes 1,2 N=374, Sum(WT)=8068.60, Mean(Y)=14.1718 Group 5 V16631: V16631: AGE OF 1989 HEAD, Codes 3-8 N=675, Sum(WT)=16990.1, Mean(Y)=21.9958 Group 6 V16973: V16973: D1 CHKPT, Codes 1 N=530, Sum(WT)=11978.1, Mean(Y)=24.3445 Group 8 V16631: V16631: AGE OF 1989 HEAD, Codes 3,4 N=329, Sum(WT)=6092.80, Mean(Y)=20.7715 Group 16 V17543: V17543: HEAD GEOGRAPHIC, Codes 1 N=174, Sum(WT)=3267.40, Mean(Y)=16.8568 Group 17 V17543: V17543: HEAD GEOGRAPHIC, Codes 2,3 N=155, Sum(WT)=2825.40, Mean(Y)=25.2987 Group 9 V16631: V16631: AGE OF 1989 HEAD, Codes 5-8 N=201, Sum(WT)=5885.30, Mean(Y)=28.0435 Group 7 V16973: V16973: D1 CHKPT, Codes 2,3 N=145, Sum(WT)=5012.00, Mean(Y)=16.3825
And here is an easily revised final hierarchical table suitable for presentations. (You could also add vertical lines linking pairs, 2-3,6-7 etc.)
HOURLY EARNINGS OF FAMILY HEADS IN 1984 (Including only those with earnings between $.50 and 99.99) * =FINAL GROUP Group 1: All Cases N=5235, Mean(Y)=$13.02 Group 2 EDUCATION 1989 H, NO COLLEGE DEGREE N=4186, Mean(Y)=$10.81 Group 10 MARRIED MEN N=2535, Mean(Y)=$12.59 Group 12 EDUCATION 12 GRADES OR LESS N=1420, Mean(Y)=$11.01* Group 13 EDUCATION 13+ NO COLL DEGREE N=1115, Mean(Y)=$14.45 Group 14 AGE 18-34, N=703, Mean(Y)=$12.58* Group 15 AGE 35+ N=412 Mean(Y)=$16.24 Group 18 CITY OF 25,000+NEARBY, N=287, Mean(Y)=$17.83* Group 19 NO CITY OF 25,000+ NEARBY N=125, Mean(Y)=$12.82* Group 11 SINGLE MAN OR WOMAN N=1651, Mean(Y)=$8.55* Group 3 COLLEGE GRAD OR MORE N=1049, Mean(Y)=$19.48 Group 4 AGE 18-29 N=374, Mean(Y)=$14.17* Group 5 AGE 30+ N=675, Mean(Y)=$22.00 Group 6 MARRIED MEN N=530, Mean(Y)=$24.34 Group 8 AGE 30-39 N=329, Mean(Y)=$20.77 Group 16 LIVING IN SAME STATE GREW UP N=174, Mean(Y)=$16.86* Group 17 LIVING IN DIFFERENT STATE N=155, Mean(Y)=$25.30* Group 9 AGE 40+ N=201, Mean(Y)=$28.04* Group 7 SINGLE MEN, WOMEN N=145, Mean(Y)=$16.3825*
IMPLICATIONS: College grads need to move to a different state, while the rest need to be in or near a large city. Age matters more to college grads.
Belsen, William A. (1959), “Matching and Prediction on the Principle of Biological Classification”, Applied Statistics, VIII:65-75.
Fielding, A. (1977), “Binary Segmentation: The Automatic Interaction Detector and Related Techniques for Exploring Data Structures”, in Analysis of Survey Data, Vol I, Exploring Data Structures, Eds. C. A. O’Muircheartaigh and C. Payne, New York: Wiley, 1977.
Kalton, Graham (1967), “Techniques for Choosing the Number of Alternative Response Categories to Provide in Order to Coerce an Individual’s Position on a Continuum”, Sampling Section, Institute for Social Research, University of Michigan, Ann Arbor.
Lim, Tjen-Sien, Wei-Yin Loh, and Yu-Shan Shih (2000), “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms”, Machine Learning, 10:203-228. (Focused on categorical criterion variables, extensive bibliography.)
Mckenzie, D. P., and L. H. Low (1992), “The Construction of Computerized Classification Systems Using Machine Learning Algorithms: An Overview”, Computers in Human Behavior, 8:155-167.
Morgan, James N. (1985), “Comparing Static and Dynamic Estimates of Behavioral Responses to Changes in Family Composition or Income”, Journal of Consumer Research,12:83-89.
Morgan, James N. (1990), “A Conditional Analysis of Movers Housing Responses”, Journal of Economic Behavior and Organization, 14:79-95.
Morgan, James N. and John A. Sonquist (1963), “Problems in the Analysis of Survey Data, and a Proposal”, Journal of the American Statistical Association, 58:415-435.
Morgan, James N., Ismail Sirageldin, and Nancy Baerwaldt (1966), Productive Americans, Ann Arbor, Mich.: Institute for Social Research.
Sonquist, John A., Elizabeth Laud Baker, and James N. Morgan (1973), Searching for Structure, (Rev. Ed.), Ann Arbor, Mich.: Institute for Social Research.
Yule, G. Udny, and M. G. Kendall (1937), Introduction to the Theory of Statistics, London: Griffin.