Zero-inflated ordered logit model

New In

Zero-inflated ordered logit model

Stata’s new ziologit command fits zero-inflated ordered logit models.

Ordered logit regression is used to model ordered categorical responses, such as symptom severity recorded as none, mild, moderate, or severe. Larger values of such ordered outcomes represent higher levels, but the numeric value is irrelevant.

In some situations, more zeros (or more values in the lowest category) are observed in the data than would be expected by a traditional ordered logit model. A zero might represent the absence of a trait while the remaining values represent increasing levels of the trait. Many zeros may be observed, some because the individual does not have the trait, and some because an individual has the trait but exhibits the lowest level. For example,

In a study of alcohol consumption, some individuals report no consumption because they never drink alcohol while others may report no alcohol consumption because they did not drink in the survey period.

In a clinical trial of a treatment intended to shrink tumors, outcomes represent no improvement, partial response, or complete response. An individual may show no improvement because the tumor is resistant to treatment or because the tumor was treatable but did not shrink at the time of measurement. The distinction is important because treatable tumors are good candidates for a higher dose.

In contexts such as these, you can use a zero-inflated ordered logit (ZIOL) model. ZIOL models assume that the lowest-valued outcomes come from both a logit model and an ordered logit model, allowing different sets of predictors for each model.

Highlights

Model ordinal data with an overabundance of responses in the lowest category
Use a logit model to identify zero inflation and an ordered logit model for the ordinal response
Use a potentially different set of predictors for the logit and ordered logit model
Easily interpret findings using odds ratios and marginal probabilities
Support for Bayesian estimation
Robust, cluster–robust, and bootstrap standard errors
Complex survey designs support

Let’s see it work

For this example, we will use fictional data on cigarette consumption.

. use https://www.stata-press.com/data/r17/tobacco

The outcome of interest, tobacco, represents daily cigarette consumption as an ordinal response with four levels:

. codebook tobacco


    
    tobacco                                                                   Tobacco usage
    
                                                                                           
                      Type: Numeric (byte)                                                 
                     Label: tobaclbl                                                       
                                                                                           
                     Range: [0,3]                         Units: 1                         
             Unique values: 4                         Missing .: 0/15,000                  
                                                                                           
                Tabulation: Freq.   Numeric  Label                                         
                            9,469         0  0 cigarettes                                  
                            3,806         1  1–7 cigarettes/day                            
                            1,050         2  8–12 cigarettes/day                           
                              675         3  >12 cigarettes/day

More than half of the respondents reported no cigarette consumption. We suspect that these respondents belong to one of two groups: nonsmokers and would-be smokers with no current smoking activity. A traditional ordered logit regression can model the level of cigarette consumption among smokers, but it cannot distinguish between the two groups of respondents who reported no cigarette consumption. The ZIOL model introduces the concept of succeptibility to smoking, wherein smokers (both active and would-be) are susceptible to smoking, while genuine nonsmokers are not susceptible to smoking. To allow for the possibility of genuine nonsmokers, we choose the ZIOL model over the traditional ordered logit model.

We will use ziologit to simultaneously model the level of cigarette consumption and the probability of being a smoker. To model the level of cigarette consumption, we include predictors in the ziologit command directly after the dependent variable tobacco. To model the probability of being a smoker, we include predictors in the inflate() option, so named because it is used to model zero inflation. The inflate() option is required because excluding it would be tantamount to fitting a traditional ordered logit model.

Suppose that we want to regress the level of cigarette consumption on years of education (education), income in $10,000s (income), and gender (female), while we want to model the probability of being a smoker with independent variables education and income, as well as a variable indicating whether either of the respondent’s parents smoked (parent).

We could fit this model using the following command:

. ziologit tobacco education income i.female, inflate(income education i.parent)

Iteration 0:   log likelihood = -15977.364  (not concave)
Iteration 1:   log likelihood =  -13149.83  (not concave)
Iteration 2:   log likelihood = -12467.245
Iteration 3:   log likelihood = -11039.218
Iteration 4:   log likelihood = -9929.2298
Iteration 5:   log likelihood = -9715.1143
Iteration 6:   log likelihood = -9703.2464
Iteration 7:   log likelihood = -9703.2168
Iteration 8:   log likelihood = -9703.2168

Zero-inflated ordered logit regression                 Number of obs =  15,000
                                                       Wald chi2(3)  = 3147.70
Log likelihood = -9703.2168                            Prob > chi2   =  0.0000



     tobacco   Coefficient  Std. err.      z    P>|z|     [95% conf. interval]

tobacco                                                                       
   education     .5090816   .0094838    53.68   0.000     .4904938    .5276695
      income      .583636   .0114401    51.02   0.000     .5612139    .6060581
                                                                              
      female                                                                  
     Female     -.5307721   .0580736    -9.14   0.000    -.6445943   -.4169499

inflate                                                                       
      income    -.1279677     .00705   -18.15   0.000    -.1417856   -.1141499
   education    -.1412459   .0049693   -28.42   0.000    -.1509855   -.1315062
                                                                              
      parent                                                                  
    Smoking      1.187864   .0529432    22.44   0.000     1.084097     1.29163
       _cons     2.617219   .1156891    22.62   0.000     2.390473    2.843966

       /cut1      5.85957    .104449                      5.654853    6.064286
       /cut2     11.14187   .1945483                      10.76056    11.52318
       /cut3      14.3632   .2495117                      13.87417    14.85224

There are three sections to the results table. The first section, labeled tobacco, contains coefficients from the ordered logit model for the level of cigarette consumption. The second section, labeled inflate, contains coefficients from the logit model for the probability of being a smoker. The third section contains the cutpoints from the ordered logit model.

To interpret the first two sections of the results table, the coefficients can be exponentiated and reported as odds ratios with the or option.

. ziologit, or

Zero-inflated ordered logit regression                 Number of obs =  15,000
                                                       Wald chi2(3)  = 3147.70
Log likelihood = -9703.2168                            Prob > chi2   =  0.0000



     tobacco   Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]

tobacco                                                                       
   education     1.663763   .0157788    53.68   0.000     1.633122    1.694978
      income     1.792544   .0205068    51.02   0.000     1.752799    1.833191
                                                                              
      female                                                                  
     Female      .5881507    .034156    -9.14   0.000     .5248755     .659054

inflate                                                                       
      income     .8798818   .0062032   -18.15   0.000     .8678073    .8921242
   education     .8682758   .0043147   -28.42   0.000     .8598602    .8767738
                                                                              
      parent                                                                  
    Smoking      3.280066   .1736572    22.44   0.000     2.956768    3.638714
       _cons     13.69758   1.584661    22.62   0.000     10.91866    17.18378

       /cut1      5.85957    .104449                      5.654853    6.064286
       /cut2     11.14187   .1945483                      10.76056    11.52318
       /cut3      14.3632   .2495117                      13.87417    14.85224

Note: Estimates are transformed only in the first 2 equations.
Note: _cons estimates baseline odds.

Here we see that a $10,000 increase in annual income decreases the odds of being a smoker by a factor of 0.88 (12% decrease in odds), but, among smokers, increases the odds of higher cigarette consumption by a factor of 1.79 (79% increase in odds). This suggests that wealthier individuals are less likely to smoke, but if they do decide to smoke, they tend to smoke more cigarettes.

But what do these results really mean in terms of the probability of exhibiting different smoking behavior? We can use margins to answer different questions using the parameters of our model. Say we are interested at the relation of cigarette consumption and income level. Below, we estimate the probabilities for each level of cigarette consumption at annual incomes of $0, $50,000, $100,000, $150,000, and $200,000.

. margins, at(income=(0(5)20))

Predictive margins                                       Number of obs = 15,000
Model VCE: OIM

1._predict : Pr(tobacco=0), predict(pmargin outcome(0))
2._predict : Pr(tobacco=1), predict(pmargin outcome(1))
3._predict : Pr(tobacco=2), predict(pmargin outcome(2))
4._predict : Pr(tobacco=3), predict(pmargin outcome(3))

1._at: income =  0
2._at: income =  5
3._at: income = 10
4._at: income = 15
5._at: income = 20



                          Delta-method                                        
                   Margin   std. err.      z    P>|z|     [95% conf. interval]

_predict#_at                                                                  
        1 1      .7428698   .0044443   167.15   0.000     .7341591    .7515805
        1 2      .6190759   .0038733   159.83   0.000     .6114843    .6266675
        1 3      .5168462   .0052057    99.29   0.000     .5066433    .5270492
        1 4       .526699   .0092168    57.15   0.000     .5086344    .5447636
        1 5      .6340465   .0138387    45.82   0.000     .6069232    .6611697
        2 1      .2121431   .0034296    61.86   0.000     .2054211    .2188651
        2 2      .2792459   .0033861    82.47   0.000     .2726092    .2858826
        2 3      .3042245   .0040212    75.65   0.000     .2963431     .312106
        2 4      .2226386   .0050478    44.11   0.000     .2127452     .232532
        2 5      .0633686   .0047963    13.21   0.000     .0539681    .0727692
        3 1      .0372614   .0014098    26.43   0.000     .0344983    .0400245
        3 2      .0737865   .0019981    36.93   0.000     .0698702    .0777027
        3 3      .1146585   .0029075    39.44   0.000     .1089599    .1203572
        3 4      .1351544   .0041403    32.64   0.000     .1270395    .1432693
        3 5       .138638   .0052133    26.59   0.000     .1284201    .1488559
        4 1      .0077257   .0005647    13.68   0.000     .0066189    .0088324
        4 2      .0278917   .0011614    24.01   0.000     .0256153     .030168
        4 3      .0642707    .002228    28.85   0.000     .0599038    .0686376
        4 4       .115508   .0045623    25.32   0.000     .1065661      .12445
        4 5      .1639469   .0085572    19.16   0.000      .147175    .1807188

Here we calculate the expected probabilities of each level of cigarette consumption at $0, $50,000, $100,000, $150,000, and $200,000 annual income.

In the output table, there are many combinations of income and cigarrete consumption levels. In such cases, it is more effective to present the results graphically. We can visualize the expected probabilities over all income levels by using marginsplot.

The probability of smoking 0 cigarettes decreases as annual income increases until $100,000, then the probability gradually increases again. The probability of smoking 1–7 cigarettes/day is highest when earnings are $100,000 per year, and lowest when earnings are $200,000 per year.

After reviewing the overall probability of each outcome, we want to examine the relationship between income and the susceptibility to smoking. We use margins to calculate ps, the probability of susceptibility, at the same five levels of income.

. quietly margins, predict(ps) at(income=(0(5)20))

. marginsplot

Four-fifths of respondents when income is zero are either smokers or would-be smokers. The probability of being a smoker decreases with increasing income, with just over a third of respondents susceptible to smoking when earnings are $200,000 per year. This supports the interpretation that income may act as a proxy for health consciousness.

Next we use margins to focus on subjects who are susceptible to smoking. By specifying statistic pcond1 along with each outcome level, we calculate the probability of each level of tobacco, conditional on susceptibility. As before, calculations are performed at five levels of income and graphed with marginsplot.

. quietly margins, predict(pcond1 outcome(0)) predict(pcond1 outcome(1))
 predict(pcond1 outcome(2)) predict(pcond1 outcome(3)) at(income=(0(5)20))

Well over half of the would-be smokers, when annual income is zero, report 0 cigarette consumption, and those that do consume cigarettes are most likely to smoke just a few cigarettes per day. As income increases, the probability of 0 consumption falls, with virtually all smokers expected to have positive cigarette consumption when earnings are $200,000 per year. Higher annual income is associated with a higher probability of being a heavy smoker: the probability of consuming 1–7 cigarettes per day begins to fall as annual income exceeds $100,000, while the probability of consuming >12 cigarettes per day increases with income to become the most common smoking outcome when income is highest. This suggests that, among smokers, cigarettes are treated as what economists call a normal good; that is, something for which demand increases when income increases.

We can see from this example that the effect of income on cigarette consumption is multifaceted. The ziologit command makes it possible to model smoking susceptibility as well as smoking intensity, leading to a better understanding of the factors influencing smoking behavior.

View all the new features

Order Stata 17

Upgrade