Confidence intervals for multinomial logistic regression in

Feb 17, 2006 - INTRODUCTION ... reported similar results from a simulation study of multiple binary covariates ... differences are further investigated in Section 4, using finite sample ... response in category j, relative to category 0, is a linear function of ..... 11. 1864 change in screening policy, a test for interaction between ...
246KB taille 28 téléchargements 403 vues
STATISTICS IN MEDICINE Statist. Med. 2007; 26:903–918 Published online 17 February 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/sim.2518

Condence intervals for multinomial logistic regression in sparse data Shelley B. Bull1 ;2; ∗; † , Juan Pablo Lewinger1; 3; ‡; ¶ and Sophia S. F. Lee1; 2; § 1 Samuel

Lunenfeld Research Institute; Prosserman Centre for Health Research; Mount Sinai Hospital; Toronto; Ont:; Canada M5G 1X5 2 Department of Public Health Sciences; University of Toronto; Toronto; Ont:; Canada 3 Department of Statistics; University of Toronto; Toronto; Ont:; Canada

SUMMARY Logistic regression is one of the most widely used regression models in practice, but alternatives to conventional maximum likelihood estimation methods may be more appropriate for small or sparse samples. Modication of the logistic regression score function to remove rst-order bias is equivalent to penalizing the likelihood by the Jereys prior, and yields penalized maximum likelihood estimates (PLEs) that always exist, even in samples in which maximum likelihood estimates (MLEs) are innite. PLEs are an attractive alternative in small-to-moderate-sized samples, and are preferred to exact conditional MLEs when there are continuous covariates. We present methods to construct condence intervals (CI) in the penalized multinomial logistic regression model, and compare CI coverage and length for the PLE-based methods to that of conventional MLE-based methods in trinomial logistic regressions with both binary and continuous covariates. Based on simulation studies in sparse data sets, we recommend prole CIs over asymptotic Wald-type intervals for the PLEs in all cases. Furthermore, when nite sample bias and data separation are likely to occur, we prefer PLE prole CIs over MLE methods. Copyright ? 2006 John Wiley & Sons, Ltd. KEY WORDS:

asymptotic bias; Bayesian estimates; bias reduction; continuous covariate; data separation; innite estimates; Jereys prior; odds ratio; polychotomous logistic regression; polytomous logistic regression; small samples

∗ Correspondence

to: S. B. Bull, Samuel Lunenfeld Research Institute, Prosserman Centre for Health Research, Lebovic Building 5th oor, Mount Sinai Hospital, 600 University Avenue, Toronto, Ont., Canada M5G 1X5. † E-mail: [email protected] ‡ E-mail: [email protected] § E-mail: [email protected] ¶ Current address: Department of Preventive Medicine, University of Southern California, CA, U.S.A. Contract=grant sponsor: Natural Sciences and Engineering Research Council of Canada Contract=grant sponsor: Network for Centres of Excellence in Mathematics (Canada) Contract=grant sponsor: Canadian Institutes of Health Research

Copyright ? 2006 John Wiley & Sons, Ltd.

Received 4 February 2005 Accepted 3 January 2006

904

S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

1. INTRODUCTION In nite samples, the usual maximum likelihood estimates (MLEs) of the log-odds-ratio parameters in logistic regression are biased, and there is a non-zero probability that an MLE is innite, i.e. does not exist. Existence problems can occur when the data are sparse or when there are large covariate eects. This corresponds to the problem of separation, described by Albert and Anderson [1] and Lesare and Albert [2]. It is not unusual for separation to occur in small-to-moderate-sized data sets, especially in multinomial logistic regression when the number of parameters is large. In exponential family models with canonical parameterization, which includes the binomial logistic model, Firth [3] showed that introducing bias into the score function to remove the order n−1 bias of the MLEs is equivalent to penalizing the likelihood by the Jereys prior. Ibrahim and Laud [4] provided theoretical support for the use of the Jereys prior in Bayesian analysis of generalized linear models, including the existence of posterior moments, with an application to the binomial logistic model. The penalized likelihood has the advantage that estimates can be obtained in samples in which the MLEs are innite, and is attractive for routine applications in logistic regressions of mixed binary and continuous covariates, in which exact methods may be dicult to apply due to overconditioning. For example, when one or more of the covariates is continuous and the number of unique covariate combinations is large, support for the exact conditional distribution of the sucient statistic for the parameter of interest can become extremely discrete or even degenerate, resulting in little or no information left on which to base inferences. Bull et al. [5] extended Firth’s modication to the multinomial logistic regression model, providing a general form for the penalized likelihood, and specied an algorithm for estimation of the regression parameters. In general, the multinomial likelihood and corresponding regression parameter point and interval estimation cannot be reduced to a series of binomial models without incurring a loss of eciency [6]. In particular, the information matrix includes entries corresponding to the multinomial probability covariances and the multinomial penalty function will not be a simple product of binomial penalties. As a result, the problem of tting a multinomial logistic regression model with penalized likelihood methods cannot be circumvented by tting several binomial models. In small sample studies of binomial and trinomial logistic regression in a cohort study design with binary and normally distributed covariates, Bull et al. [5] found that the penalized maximum likelihood estimates (PLEs) were eectively unbiased, had smaller mean squared error (MSE) than the MLEs (with relative MSE as low as 30 per cent), and were more eective in reducing nite sample bias than alternative methods [7]. Heinze and Schemper [8] reported similar results from a simulation study of multiple binary covariates in binomial logistic regression analysis under a case–control design. They also found PLEs to be less biased than alternatives, including exact logistic regression estimates. Asymptotically, the MLEs are normally distributed around the true parameter value with variance given by the inverse of the Fisher information matrix, but in nite samples the quadratic approximation to the log-likelihood may not apply [9]. Wald test statistics and condence intervals (CIs) based on large sample standard errors can have poor properties when the parameter is far from zero [10]; CIs based on the prole likelihood are generally recommended in small samples [11]. The PLEs are likewise asymptotically distributed, and the rst-order asymptotic covariance matrix of the PLEs is the same as that of the MLEs [3]. In nite samples, the PLEs always exist, so both Wald and prole-likelihood CIs can always be Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

CONFIDENCE INTERVALS FOR MULTINOMIAL LOGISTIC REGRESSION

905

constructed. However, use of symmetric Wald-type CIs based on the PLEs may be ill-advised because the small samples in which PLEs are most useful will also be those in which the loglikelihood is not quadratic. Prole-likelihood CIs for innite MLEs can be constructed with one nite and one innite endpoint. In contrast, those for the PLEs, while also non-symmetric, have two nite endpoints. Based on simulation studies of the penalized binomial logistic regression model with multiple binary covariates reported by Heinze [12], Heinze and Schemper [8] recommended prole CIs based on the penalized likelihood in situations with a high probability of separation, and found the Wald CIs to be satisfactory only for balanced covariate distributions and modest log odds ratios less than 1.4. In simulations comparing PLE and MLE prole 95 per cent CIs, also limited to the binomial model, Bull and Lewinger [13] reported close to nominal coverage for the PLE prole CIs for log odds ratios as large as 2, even in very small sample sizes of 25 with one binary and one continuous covariate, but in some cases found less than nominal coverage for the MLE prole CIs. In this report, we extend methods for CI construction to the general multinomial logistic regression model, and compare the performance of the PLE-based methods to that of conventional MLE-based methods with respect to CI coverage and length. In Section 2, we begin with a review of notation and methods for point and interval estimation in the usual and penalized multinomial logistic likelihoods. In Section 3, we consider a problem in dierential diagnosis with continuous covariates and revisit a sparse data set from a large disease prevention trial in which comparisons of alternative CIs reveal interesting dierences. These dierences are further investigated in Section 4, using nite sample simulations of trinomial logistic models with both binary and continuous covariates, focussing on scenarios in which the ratio of the sample size to the number of parameters is low. A closing discussion notes the relationship of the proposed general methods to classical methods for the case of a 2 × 2 table, and recommends the wider use of prole CIs for PLEs in sparse data.

2. INFERENCE FOR THE PENALIZED MULTINOMIAL LOGISTIC LIKELIHOOD 2.1. Estimation of regression parameters We consider a multicategory outcome y that is a multinomial variable with J + 1 categories. For each category j(j = 1; : : : ; J ) there is a regression function in which the log odds of response in category j, relative to category 0, is a linear function of regression parameters and a vector x of P covariates (including a constant): log{prob(y = j | x)=prob(y = 0 | x)} = RjT x. Let yi be a J × 1 vector of indicators for the observed response category for observation i, with the corresponding J × 1 vector of probabilities i = (i1 ; : : : ; iJ )T . The vector of MLEs, Bˆ = vec[(Rˆ1 ; : : : ; RˆJ )T ], is estimated from observations (yi ; xi ), i = 1; : : : ; n, by solving the score equations of the log-likelihood l(B). We denote the score function by U (B), and the PJ × PJ Fisher information matrix by A = A(B). The order n−1 bias of estimates based on the usual likelihood L(B) is removed by applying the penalty |A|1=2 , and basing estimation on the penalized likelihood L∗ (B) = L(B)|A|1=2 . The vector Bˆ ∗ of penalized estimates (PLEs) is the solution to the score equations of the penalized log-likelihood l∗ (B) = l(B)+ 12 log |A|. The introduction of bias into the score function through the penalty removes the leading term in the asymptotic bias of the MLEs. The modied score Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

906

S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

function proposed by Firth for the binomial logistic model extends directly to the multinomial model as U ∗ (B) = U (B)−A b(B). The bias term b(B) is the leading term in the asymptotic bias of the multinomial MLEs, obtained from the Taylor series expansion of the log-likelihood [14], and is a function of the matrix of third derivatives of l(B) with respect to B [5]. The solution of U ∗ = 0 locates a stationary point of l∗ (B) which is equivalent to the penalized likelihood with the Jereys invariant prior as the penalty function [3]. Use of the Jereys prior shrinks estimates toward the point ij = 1=(J + 1) which maximizes the determinant and corresponds to Rj = 0. Arguments for the existence and uniqueness of estimates in the binomial model extend to the multinomial model in a straightforward manner, and in general, the PLEs can be obtained by a modied scoring algorithm, as described previously [5]. With increasing sample size, the eect of the penalty diminishes and the PLEs approach the MLEs. Other penalty functions of the form |A|c could also be applied to the likelihood; Greenland [15] demonstrates these for the conditional logistic model. Values of c¿ 12 correspond to priors stronger than Jereys, further reducing MSE at the cost of introducing negative bias on the log-odds-ratio scale; we do not consider these further here.

2.2. Condence interval construction for regression parameters Under asymptotic normality of the MLEs, the estimated large sample variance–covariance ˆ is obtained by evaluating the inverse of the Fisher information, A−1 , at the matrix, Var(B), MLEs. An asymptotic two-sided 100(1 − ) per cent CI for a single parameter jp corresponds to inversion of a 1 degree of freedom (df) family of Wald tests of H0 : jp = s, which yields the symmetric interval ˆjp ± z=2 Var 1=2 (ˆjp ). In practice, this CI is frequently applied in nite samples. When there is separation in a data set, and the MLE lies on the boundary of the parameter space, the Wald CI for the corresponding innite component of the MLE is uninformative, and can be dened as the entire real line (−∞; +∞) [16]. The likelihood ratio test has better properties than the Wald test when the normality of the MLE is in doubt, and CIs based on inversion of the likelihood ratio statistic are not necessarily symmetric, reecting any departures from a quadratic log-likelihood [11]. When there is separation in a data set, the prole-likelihood CI for the corresponding innite component of the MLE has the form (−∞; u) or (l; +∞). Asymmetric CIs for the PLEs can be constructed from the prole log-likelihood for the parameter jp , which is the function l∗0 (B(s)), where B(s) is the argument that maximizes l∗ under the single-parameter constraint H0 : jp = s. The 100(1 − ) per cent CI for jp is given by all parameter estimate values that are compatible with the data, i.e. all s such that the likelihood ratio statistic LR P (s)6q, where q is the (1 − ) percentile of the 2 distribution. This is equivalent to l∗0 (B(s))¿{l∗ (Bˆ ∗ ) − 12 q}. The endpoints of the interval are then found by numerically solving the equality for values of s. Based on the algorithm employed in SAS PROC LOGISTIC for MLEs [17], our preferred method for nding these roots does not require computing l∗0 (B(s)), which in itself would involve maximizing l∗ (B), but proceeds directly by solving the constrained optimization problem: maximize l∗ (B) such that l∗ (B) = {l∗ (Bˆ ∗ ) − 12 q} and jp = s. We, however, modify the starting values in the iterative scheme used by SAS, to obtain a new algorithm that is slower, but simple and more robust (see the Appendix for details). We implemented the parameter estimation and CI methods using the matrix programming language GAUSS [18]. Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

CONFIDENCE INTERVALS FOR MULTINOMIAL LOGISTIC REGRESSION

907

The information matrix A∗ for the penalized likelihood is considerably more complex than the information matrix A, due to the additional higher-order terms arising from the loglikelihood penalty (see details in Reference [19]). These higher-order terms disappear as the sample size increases, but in small samples Var ∗ (ˆ∗jp ), obtained from the inverse of A∗ evaluated at the PLEs, is generally smaller than Var(ˆ∗jp ) obtained from A; this is a strict inequality in the case of a single binary covariate. Although a symmetric Wald-type CI for a single parameter jp can be constructed using Var ∗1=2 (ˆ∗jp ), its performance is expected to be poor in situations where separation is likely to occur.

3. APPLICATIONS In data described by Albert and Harris [20], the outcome variable represents three groups of patients with diering hepatitis diagnoses. We specify two regression functions in the multinomial logistic model: one regression compares patients with acute viral hepatitis to those with persistent chronic disease; the other compares patients with aggressive chronic disease to those with persistent chronic disease. Here, we focus on two of the quantitative liver function analytes measured to discriminate among the groups, using standardized versions of the natural logs of AST and GlDH as covariates. To illustrate the consequences of near separation on parameter estimates and CIs when the covariates are continuous, we have deleted 10 observations from the original sample of 141. While the acute viral and persistent chronic groups do not overlap in the two-dimensional covariate space, the aggressive chronic group does overlap both of the others, so that the multinomial model estimation based on this data set does not suer from partial separation as dened by Lesare and Albert [2]. However, if a binomial model is t to the acute viral and persistent chronic groups alone, data separation leads to innite estimates for both covariates. In the multinomial model tting (Table I), we observe that in comparison to the MLEs, the PLEs are smaller in magnitude and their CIs are shorter in length. The MLE results are questionable for the acute versus persistent group comparison, and nding the MLE prole CI endpoints can be dicult in that convergence of the algorithm is sensitive to the starting values and to the variable transformation applied to the AST and GlDH analyte covariates. For both MLEs and PLEs, there are rather large dierences between the multinomial and binomial model estimates (Table I). In the case of separation, as seen in the binomial model for the acute versus persistent group comparison, the PLE Wald∗ and prole CIs can be quite dierent, including zero in the Wald∗ interval but excluding it in the prole interval. As a second illustration, we return to a data set examined previously [5], and apply the CI methods to a trinomial outcome from a multicentre trial of a population intervention designed to prevent post-transfusion hepatitis [21]. The preventive intervention (treatment factor) under evaluation was the screening of donor units for two surrogate markers of nonA, non-B hepatitis infection. Blood transfusion recipients were randomized to receive units from one of the two sources: from the general blood supply or from a supply that excluded units positive for the surrogate markers. In addition, while the trial was on-going, there was a change in national blood screening policy whereby a new test was introduced to screen all units for hepatitis C antibodies (time factor). This had the eect of decreasing the incidence of hepatitis C. To evaluate whether the intervention was equally eective before and after the Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

908

S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

Table I. Conventional and penalized maximum likelihood estimates (95 per cent CIs) for the dierential diagnosis data. Multinomial Covariates:

AST analyte∗

Binomial

GlDH analyte∗

AST analyte∗

GlDH analyte∗

(a) Acute viral hepatitis (n = 54) versus persistent chronic hepatitis (n = 37) MLE 6.58 2.06 +∞ Wald CI (3:20; 9:96) (−0:03; 4:15) (−∞; +∞) Prole CI (3:88; 10:85) (0:29; 2:06) (153:7; +∞) PLE Wald∗ CI Prole CI

5.51 (2:87; 8:16) (3:23; 9:14)

1.64 (−0:08; 3:36) (0:08; 3:91)

22.45 (−5:45; 50:35) (3:67; 74:50)

+∞ (−∞; +∞) (54:69; +∞) 7.71 (−1:61; 17:04) (1:15; 25:78)

(b) Aggressive chronic hepatitis (n = 40) versus persistent chronic hepatitis (n = 37) MLE 5.45 3.75 5.06 2.49 Wald CI (2:08; 8:82) (1:59; 5:90) (1:70; 8:42) (0:52; 4:46) Prole CI (2:75; 8:60) (1:91; 3:75) (2:39; 9:38) (0:85; 4:97) PLE Wald∗ CI Prole CI ∗

4.44 (1:81; 7:08) (2:17; 8:07)

3.24 (1:45; 5:03) (1:61; 5:55)

4.08 (1:49; 6:67) (1:87; 7:71)

2.03 (0:46; 3:60) (0:62; 4:14)

ln of analyte standardized by covariate mean and variance.

Table II. Hepatitis prevention trial data. Hepatitis outcome C

non-ABC

No disease

Time 1 Treated Untreated

0 5

2 3

400 389

Time 2 Treated Untreated

3 5

10 11

1896 1864

change in screening policy, a test for interaction between the treatment and time factors was of interest. Although the total sample size is large, the disease outcomes are rare, producing an empty cell in one subgroup (Table II). As a result, in the model with an interaction between time and treatment, the usual logistic regression MLEs are innite for two of the parameters (Table III). The corresponding Wald CIs, which are undened, are set to be the entire real line. The prole-likelihood CIs for the innite MLEs have one innite endpoint, indicating that we cannot rule out a coecient of  = +∞ for the interaction, and  = −∞ for the treatment eect. This corresponds to the possibility that a person with a given covariate value can be aected or unaected with certainty, which is implausible in most cases, and in this study more likely reects the low frequency of the hepatitis C outcome. The PLEs, however, can be obtained for all parameters, and in contrast to the MLEs, the Wald∗ and the prole-likelihood CIs for the PLEs have two nite endpoints. As demonstrated graphically in Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

CONFIDENCE INTERVALS FOR MULTINOMIAL LOGISTIC REGRESSION

909

Table III. Conventional and penalized maximum likelihood estimates, (95 per cent CIs) for the prevention trial data. Treatment

Time period

Treatment by time

−1:57 (−2:81; −0:32) (−2:85; −0:28)

+∞ (−∞; +∞) (−0:15; +∞)

−1:57 (−2:75; −0:38) (−2:79; −0:34)

1.96 (−1:24; 5:15) (−0:70; 6:95)

(b) Hepatitis non-ABC outcome MLE −0:43 Wald CI (−2:23; 1:36) Prole CI (−2:46; 1:37)

−0:27 (−1:55; 1:01) (−1:44; 1:22)

0.32 (−1:67; 2:31) (−1:67; 2:50)

−0:36 (−1:99; 1:27) (−2:16; 1:28)

−0:38 (−1:58; 0:83) (−1:49; 0:99)

0.26 (−1:58; 2:09) (−1:58; 2:22)

(a) Hepatitis C outcome MLE −∞ Wald CI (−∞; +∞) Prole CI (−∞; −0:78) PLE Wald∗ CI Prole CI

PLE Wald∗ CI Prole CI

−2:43 (−5:33; 0:48) (−7:30; −0:24)

Figure 1. Depiction of the MLE and PLE prole likelihoods and the PLE quadratic approximation for the treatment eect parameter for the hepatitis C outcome. The scale on the vertical axis in this gure corresponds to {l∗ (B) − (l∗ (Bˆ ∗ ) − 12 q)} for the PLE prole likelihood (as dened in Section 2.2), where q is the (1 − ) percentile of a 2 (1 df) distribution, with  = 0:05. The intersection with the horizontal axis thus yields the 95 per cent interval endpoints. The endpoints for the 95 per cent PLE Wald∗ interval are the solution to {(ˆ∗ − )2 =Var ∗ (ˆ∗ )}− q = 0, based on a quadratic approximation to the log-likelihood. The MLE prole likelihood is plotted similarly, but intersects the horizontal axis only at the upper CI endpoint.

Figure 1, the PLE prole CI for the simple treatment eect for hepatitis C in the rst time period excludes  = 0, while the corresponding Wald∗ CI based on a quadratic approximation does not, consistent with our expectation that these CIs have dierent coverage properties in sparse data. Simulation results reported in the following section conrm that the PLE prole Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

910

S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

interval is more accurate than the PLE Wald∗ interval, and nd that the MLE prole and Wald intervals can be unsatisfactory in sparse data.

4. A MONTE CARLO SIMULATION STUDY 4.1. Design The purpose of the simulation study was to evaluate the properties of the prole and Wald∗ CIs for the PLEs, and to compare their coverage and length to the CIs for the usual MLEs in multinomial logistic regressions that included correlated binary and continuous covariates. We programmed the simulations in the matrix language GAUSS. To detect separations, we applied a general algorithm adapted from others and used in previous studies [2, 5, 7]. A series of simulations was conducted over a range of sample sizes and parameter values for regressions with one binary and one continuous covariate under a cross-sectional cohort design. We generated the response category by comparing the probabilities calculated from the linear predictor(s), RjT xi , to a uniform random number. To generate a covariate vector xi for each observation in a data set, we rst generated variates from a bivariate normal distribution with zero means, unit variances, and a correlation of 0.8, followed by dichotomization at zero to produce a binary covariate. The binary-continuous covariate correlation induced by dichotomization was estimated to be 0.6 from the simulated covariate vectors. Data sets generated from a model with positively correlated covariates are more likely to show separation than those generated from the same model but with uncorrelated covariates. In a preliminary set of simulations, based on previously studied trinomial logistic regression models [5, Table 6], we examined a range of slope parameter values (data not shown). We tabulated coverage for two-sided 95 per cent CIs, and for upper and lower one-sided 97.5 per cent CIs, as well as the median length of the two-sided CIs. These quantities were also tabulated in the subset of data sets in which all the MLEs were nite, in order to observe dierences in the treatment of data sets with separation. To further explore the relationship between coverage and length, as well as the distributions of estimates, their standard errors, CI endpoints, and CI length, we conducted a second set of larger simulations, each with 10 000 replications that focussed on two selected trinomial regression models (J = 2 with one binary (x1), one normal covariate (x2) and correlation (x1; x2) = 0:6). In Model 1, the slope parameters were set to zero in both regressions: R1T = (−1:4; 0; 0), R2T = (−1:4; 0; 0). In Model 2, large non-zero slope parameter values of 2.0 and 1.0 were specied for the binary and continuous covariates, respectively: R1T = (−1:4; 2:0; 1:0), R2T = (−1:4; 2:0; 1:0). We examined sample sizes n = 25; 50; 100; 200. With 10 000 replications, the precision of the coverage estimates is such that values greater than 95.44 per cent (less than 94.56 per cent) are signicantly dierent from the nominal 95 per cent, leading to a conclusion of over (under) coverage on average. For 8700 replications, the corresponding values are 95.46 per cent (94.53 per cent). We used exact tests for marginal homogeneity [22, 23] to compare the coverage properties of the PLE prole interval to the PLE Wald∗ and the MLE prole intervals, assuming ordered categories corresponding to the true parameter being less than the lower CI endpoint, within the CI, or greater than the upper CI endpoint. To provide additional insight into situations in which discrepancies occurred among the CI methods, we stratied the results for each parameter into deciles of Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

CONFIDENCE INTERVALS FOR MULTINOMIAL LOGISTIC REGRESSION

911

1000 replicates sorted by the value of the PLE, and tabulated summary statistics within each decile. Because it was usually one of the binary covariate parameter estimates that produced separation in a data set, the summaries presented mainly focus on results for the binary covariate estimated in the presence of a correlated normally distributed covariate. 4.2. Results To illustrate the general patterns observed in the simulations, we present detailed results for a sample size of 50 in Models 1 and 2 (Table IV(a) and (b), respectively) with density plots of distributions across replications for Model 2 (Figure 2). Under Model 1, the true slope parameter values are all zero, while under Model 2, they are all non-zero. Inspection of these and companion tables reveals that: (1) As expected, the MLEs and PLEs are both unbiased when the slope parameter is zero, but only the PLEs are unbiased when the parameter is large (Table IV). In the latter case (Figure 2), a proportion of the data sets with MLEs, although not meeting criteria for separation, have log-odds-ratio estimates greater than 15 (and correspondingly extreme standard errors). Data sets with separations tend to yield PLEs with large values compared to data sets with nite MLEs (Figure 2). (2) Coverage is greater than nominal for the MLE Wald and the PLE Wald∗ intervals (Table IV). When the parameter is zero and the log-likelihoods are roughly symmetric, the proportion of data sets in which the 95 per cent CI excludes the parameter value is roughly equal at the lower and upper CI endpoints, but less than 2.5 per cent. When the parameter is large and positive, it is excluded close to or greater than 2.5 per cent of the time at the upper endpoint, but at the lower endpoint the proportion can be less than 0.5 per cent. Both the MLE and PLE Wald-type intervals therefore fail to exclude small parameter values as being consistent with the observed data. (3) The PLE prole interval yields nominal or slightly higher than nominal 95 per cent coverage, with closest agreement when all data sets are considered (Table IV). When the parameter is zero, the proportion of data sets in which the 95 per cent CI excludes the parameter value is roughly equal at the lower and upper CI endpoints, and close to 2.5 per cent. When the parameter is large and positive, it is excluded close to 2.5 per cent of the time at the upper endpoint, but somewhat less than 2 per cent of the time at the lower endpoint. Systematic underexclusion of the parameter at the lower endpoint of the Wald∗ interval is quite severe (test for marginal homogeneity p = 0:001, see Reference [19] for details), consistent with failure of the log-likelihood quadratic approximation. (4) Coverage is generally close to nominal for the MLE prole interval, but can fall below nominal in some cases, particularly for the correlated normal covariate when the true parameters for both covariates are large (Table IV(b)). (5) When the true parameter for the binomial covariate is zero, the coverage of the PLE prole interval does not dier signicantly from that of the corresponding MLE prole interval (marginal homogeneity p = 0:44). However, when the true parameter is large, the MLE prole interval overestimates the parameter at the upper endpoint (marginal homogeneity p = 0:0001, see Reference [19]). This appears to be at least partly due to use of an open-ended interval when the MLE is innite. Most disagreements occurred Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

Copyright ? 2006 John Wiley & Sons, Ltd.

All data sets (n = 10 000)

Data sets with nite MLEs (n = 8699)

All data sets (n = 10 000)

Data sets with nite MLEs (n = 9467)

Mean ˆ (Median)

Method

MLE Wald Prole PLE Wald∗ Prole

MLE Wald Prole PLE Wald∗ Prole

97.29 95.61

0.006 (−0:009)

1.47 2.39

1.65 2.67

1.27 1.99

1.67 2.70

%¡ Lower

1.24 2.00

1.45 2.14

1.10 1.71

1.47 2.24

%¿ Upper

4.03 4.14

4.34 4.47

4.02 4.12

4.32 4.44

Median length

0.001 (0.010)

— (0.008)

0.001 (0.009)

0.001 (0.007)

Mean ˆ (Median)

97.45 96.22 97.17 95.78

2.00 (1.94)

97.20 96.64

1.85 (1.85) — (2.18)

97.26 95.78

2.10 (2.07)

Per cent CI coverage

0.26 1.60

0.33 1.71

0.07 0.57

0.38 1.84

%¡ Lower

2.57 2.62

2.22 2.07

2.74 2.78

2.36 2.38

%¿ Upper

Binary covariate ( = 2:0)

4.05 4.15

4.39 4.60

3.99 4.08

4.32 4.44

Median length

1.00 (0.94)

— (1.10)

1.00 (0.94)

1.19 (1.10)

Mean ˆ (Median)

1.48 2.15

1.98 2.83

1.51 2.20

2.03 2.93

1.30 2.00

1.70 2.54

1.32 1.98

1.74 2.68

% ¿ Upper

96.00 95.45

96.01 94.20

95.98 95.40

95.92 93.39

Per cent CI coverage

0.74 1.90

1.77 4.03

0.78 1.94

1.83 4.63

% ¡ Lower

3.26 2.65

2.22 1.77

3.24 2.66

2.25 1.98

% ¿ Upper

Normal covariate ( = 1:0)

97.22 95.85

96.32 94.63

97.17 95.82

96.23 94.39

% ¡ Lower

Normal covariate ( = 0:0) Per cent CI coverage

(b) Model 2 with parameters R1T = (−1:4; 2; 1), R2T = (−1:4; 2; 1)

96.90 95.19

97.63 96.30

−0:004 (−0:011) — (−0:001)

96.86 95.06

0.006 (−0:002)

Per cent CI coverage

Binary covariate ( = 0:0)

MLE Wald Prole PLE Wald∗ Prole

MLE Wald Prole PLE Wald∗ Prole

Method

Mean ˆ (Median)

(a) Model 1 with parameters R1T = (−1:4; 0; 0), R2T = (−1:4; 0; 0)

Table IV. Simulation results (10 000 replicates) for binary and normal covariate slope parameters in 1T (n = 50).

2.45 2.56

2.71 2.84

2.42 2.52

2.67 2.73

Median length

2.04 2.09

2.19 2.26

2.03 2.09

2.18 2.23

Median length

912 S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

Statist. Med. 2007; 26:903–918

CONFIDENCE INTERVALS FOR MULTINOMIAL LOGISTIC REGRESSION

913

Figure 2. Distributions of estimates, with density estimation by kernel smoothing, for the binary covariate parameter in R1T , and the corresponding Wald and Prole CI lengths, over 10 000 replicates of data sets of size n = 50 from Model 2 with parameters R1T = (−1:4; 2; 1), R2T = (−1:4; 2; 1) for: (a) MLEs in 9467 data sets without separations (all MLEs are nite); (b) PLEs in 9467 data sets without separations (all MLEs are nite); and (c) PLEs in all 10 000 data sets.

in either the rst or the last decile of the PLE estimates, i.e. when the estimate observed in a particular data set was far from the true parameter value. (6) The median length of the PLE interval is shorter than that of the MLEs (Table IV), although the median interval length for all methods increases when data sets with separations are included, due to the CIs being longer (or innite) for large (or innite) estimates. The secondary mode in the distribution of the PLE CI length corresponds to the data sets with separation (Figure 2), and reects the abrupt transition from Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

914

S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

lack of separation to complete or partial separation associated with the discreteness of the response frequencies. The PLE prole interval was shorter than the MLE prole interval in all replicates. As the sample size increases from 25 through 50 and 100 to 200, coverage improves and dierences among the methods diminish (see Figure 3). The performance of the Wald-type intervals is particularly sensitive to the sample size and the PLE Wald interval is clearly inadequate whether the standard error is obtained from A∗ or approximated by A. Attempting to estimate six parameters with a sample size of only 25 yields conservative but relatively uninformative CIs, and more than 50 per cent of the data sets for the model with non-zero parameters have separation, producing distortion in the CI coverage among the data sets without separations. Even for moderate sample sizes of 100, the MLE prole interval tends to exhibit less than nominal coverage, particularly at the lower endpoint when the underlying parameter is non-zero. In contrast, the PLE prole interval provides close to nominal or modest overcoverage in all data sets for both zero and non-zero parameters. 5. DISCUSSION In this report, we present methods to construct CIs for multinomial logistic regression parameters that perform better than standard methods in sparse data sets. Our investigations demonstrate several advantages for the penalized maximum likelihood estimator. This method always yields nite estimates and CI endpoints for logistic regression parameters. In addition to having smaller bias and MSE than conventional MLEs [5], the prole CIs for the PLEs are generally more accurate with shorter length and achieve close to or greater than nominal coverage. Although prole CIs (with one nite endpoint) can be dened for conventional MLEs when an estimate is non-nite, these intervals have less than nominal coverage in some cases, and the absence of a nite point estimate of a parameter will usually be unsatisfactory. Even in data sets without separation, in which all parameter estimates are nite, penalized likelihood prole CIs have comparable coverage and shorter length than CIs based on conventional ML estimation. Clogg et al. [24], Firth [3], and others have noted the equivalence between a Bayesian estimator based on the Jereys prior and the correction that adds 12 to each cell in a 2 × 2 table [25]. In the simple case of a multinomial response and a binary covariate, which can be summarized in a (J + 1) × 2 contingency table, the PLEs correspond to the usual log odds ratios calculated from cross products from a table in which 12 has been added to each of the table cells. The penalized likelihood estimator examined here is a generalization of Haldane’s −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ Figure 3. Proportions of data sets (per cent) in which the 95 per cent CI for the binary covariate parameter in R1T excludes the true parameter value at the upper endpoint (above) and the lower endpoint (below), over 10 000 replicates of data sets of size n = 25; 50; 100, and 200 generated from Model 1 or 2. The proportions at the upper endpoint are plotted as (1—the proportion) to show departures from the nominal 97.5 per cent. Model 1 with parameters R1T = (−1:4; 0; 0), R2T = (−1:4; 0; 0) for Upper left: replicates without separations: all MLEs are nite in 6852; 9467; 9991, and 10 000 data sets of size 25; 50; 100; 200, respectively; Upper right: all 10 000 data sets. Model 2 with parameters R1T = (−1:4; 2; 1), R2T = (−1:4; 2; 1) for Lower left: replicates without separations: all MLEs are nite in 4850; 8699; 9909, and 9999 data sets of size 25; 50; 100; 200, respectively; Lower right: all 10 000 data sets. Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

CONFIDENCE INTERVALS FOR MULTINOMIAL LOGISTIC REGRESSION

915

Figure 3.

Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

916

S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

estimator to a general multinomial logistic regression setting with multiple covariates. The classical Woolf and Gart logit intervals studied in the 2 × 2 table by Agresti [22] are thus equivalent to the MLE Wald and PLE Wald∗ intervals, respectively, in a binomial logistic model with one binary covariate. Our results suggest that a PLE prole interval corresponding to Haldane’s estimate would perform better than the Gart Wald interval, and similarly or better than the MLE prole interval, at least for moderate eect sizes. Rather than smoothing cell probabilities toward equiprobability as in the Gart interval, Agresti recommended an alternative logit interval based on smoothing toward independence. For binomial logistic regression with multiple categorical covariates, Clogg et al. proposed a general class of shrinkage estimators, with Wald-type CIs for the regression parameters, including both equiprobability and independence smoothing as special cases. The former involves shrinking all regression parameters toward zero, as in the method investigated here, while the latter shrinks the slope parameters toward zero and the intercept toward the marginal response distribution. While Clogg et al. found that independence smoothing performed better for prediction when the response distribution was skewed, and gave sensible inferences in sparse data regressions where maximum likelihood failed, adjustment in this way does not fully remove the bias in the slope parameter estimates (see, for example, the simulations reported in Reference [7]) which will usually be unsatisfactory when estimates of association via the log odds ratio are of interest. As the sample size increases, the penalized estimates become equivalent to the usual MLEs, so routine application of penalized estimation appears to bear only the cost of implementation and additional computation [26, 27]. We concur with the conclusion of Heinze and Schemper [7] that the penalized likelihood estimation procedure originally developed by Firth [3] to reduce nite sample bias of maximum likelihood estimates is an attractive solution to the problem of separation, and conclude in addition that it may be especially valuable in multinomial logistic regression models where the number of parameters can be large relative to the sample size. We recommend that prole CIs be used routinely for the PLEs, and emphasize that prole CIs based on the standard likelihood MLEs are not appropriate in sparse data sets when nite sample bias or data separation is likely to occur. APPENDIX A: COMPUTATION OF PROFILE LIKELIHOOD CONFIDENCE INTERVALS The endpoints of prole likelihood -level condence intervals for jp are given by the values s ∈ R dening the intersection of the ‘prole’ curve B(s) with the ‘level curve’ l(B) = l0 , ˆ − 1 q1− } and q1− is the 1 −  quantile of a 2 distribution with one df. where l0 = {l(B) 2 As described in the documentation for SAS PROC LOGISTIC [17], to obtain an iterative algorithm for computing the points of intersection of the prole and the level curves, the log-likelihood (or log-penalized likelihood) function is approximated in a neighbourhood of B by the quadratic function lq (B + T) = l(B) + TT U + 12 TT AT

(A1)

where U = U (B) and A = A(B). The increment T for the next iteration is obtained by solving the equations T (B + T) − s)} = 0 d=d T{lq (B + T) + (ejp

Copyright ? 2006 John Wiley & Sons, Ltd.

(A2) Statist. Med. 2007; 26:903–918

CONFIDENCE INTERVALS FOR MULTINOMIAL LOGISTIC REGRESSION

917

where  is a Lagrange multiplier, ejp is the unit vector that extracts the element corresponding to jp , and s is an unknown parameter determined by the condition lq (B + T) = l0 . Although iteration from Bˆ is recommended [17], in our experience starting at the MLE (PLE) does not always guarantee convergence. A better starting point can be obtained by ‘bracketing’ the solution of {B(s); s ∈ R1 } ∩ {l(B) = l0 }

(A3)

ˆ = l(B using bisection. For this we nd a point R1 = B(s1 ) such that l(R1 )¡l0 . Because l(B) 2 ˆ (jp ))¿l0 , there is a solution to (A3) given by B(t), with t ∈ (s1 ; jp ). The point R = B((s1 + jp )=2) is now closer to such solution. Bisection can continue in this fashion for a few iterations to nd a better starting point for (A2) or it can continue until convergence is attained. The latter provides a slower but more robust way to nd the endpoints of prole CIs. To nd R1 we move in small steps t¿0 in the (positive or negative) direction ejp , i.e. parallel n−1 ˆ Setting S0 = Bˆ and Sn = B(Sjp +t), we iterate until l(Sn )¡l0 , to the jp axis and starting at B. 1 n in which case we set R = S . This procedure requires computing B(s) for s ∈ R1 . Since the curve B: R1 → RJP is dened for each s ∈ R1 as the solution to the maximization problem B(s) = argmax of l(B) subject to jp = s. This can be done using an iterative algorithm similar to (A2). T −1 T −1 A U )=(ejp A ejp ). The solution to (A2) is T = − A−1 (U − ejp ) with  = − (s − jp + ejp Iteration proceeds until convergence is achieved with step halving performed at each step to ensure the procedure is going uphill in l. ACKNOWLEDGEMENTS

We thank the reviewers for their careful reading and constructive comments. This research was supported by the Natural Sciences and Engineering Research Council of Canada and the Network for Centres of Excellence in Mathematics (Canada). SBB holds a Senior Investigator Award from the Canadian Institutes of Health Research. REFERENCES 1. Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984; 71:1–10. 2. Lesare E, Albert A. Partial separation in logistic discrimination. Journal of the Royal Statistical Society, Series B 1989; 51:109–116. 3. Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80:27–38. 4. Ibrahim JG, Laud PW. On Bayesian analysis of generalized linear models using Jereys’s prior. Journal of the American Statistical Association 1991; 86:981–986. 5. Bull SB, Mak C, Greenwood CMT. A modied score function estimator for multinomial logistic regression in small samples. Computational Statistics and Data Analysis 2002; 39:57–74. 6. Begg CB, Gray R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika 1984; 71:11–18. 7. Bull SB, Greenwood CMT, Hauck WW. Jackknife bias reduction for polychotomous logistic regression. Statistics in Medicine 1997; 16:545–560; Statistics in Medicine 1997; 16:2928 (Correction). 8. Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in Medicine 2002; 21:2409–2419. 9. Jennings DE. Judging inference adequacy in logistic regression. Journal of the American Statistical Association 1986; 81:471–476. 10. Hauck WW, Donner A. Wald’s test as applied to hypothesis testing in logit analysis. Journal of the American Statistical Association 1977; 81:471–476. Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918

918

S. B. BULL, J. P. LEWINGER AND S. S. F. LEE

11. Alho JM. On the computation of likelihood and score test based condence intervals in generalized linear models. Statistics in Medicine 1992; 11:923–990. 12. Heinze G. The application of Firth’s procedure to Cox and logistic regression. Technical Report 10, Department of Medical Computer Sciences, Section of Clinical Biometrics, University of Vienna, 1999 (updated 2001). 13. Bull SB, Lewinger JP. Condence intervals for logistic regression in sparse data. Proceedings of the 23rd European Meeting of Statisticians, Tecnopolo Funchal, Madiera, Portugal, Revista de Estatistica—Statistical Review 2001; 2:69–70. 14. Cox DR, Snell EJ. A general denition of residuals. Journal of the Royal Statistical Society, Series B 1968; 30:248–275. 15. Greenland S. Small-sample bias and corrections for conditional maximum-likelihood odds-ratio estimators. Biostatistics 2000; 1:113–122. 16. Agresti A. On logit condence intervals for the odds ratio with small samples. Biometrics 1999; 55:597–602. 17. SAS Institute Inc. SAS OnlineDoc, Version 8, The LOGISTIC Procedure, Condence Intervals for Parameters, Chapter 39, Section 26, Cary, NC: 1999. 18. Aptech Systems Incorporated. The GAUSS System. Version 5.0.26 (June 2003), Kent, Washington, 1990. 19. Bull SB, Lewinger JP, Lee SSF. Penalized maximum likelihood estimation for multinomial logistic regression using the Jereys prior. Technical Report No. 0505, Department of Statistics, University of Toronto, 2005. 20. Albert A, Harris EK. Multivariate Interpretation of Clinical Laboratory Data. Dekker: New York, 1984. 21. Blajchman MA, Bull SB, Feinman SV, for the Canadian post-transfusion hepatitis prevention Study Group. Post-transfusion hepatitis: impact of the non-A non-B hepatitis surrogate tests. The Lancet 1995; 345:21–25. 22. Agresti A. Categorical Data Analysis, Chapter 10. Wiley: New York, 1990. 23. Cytel Software Corporation. StatXact-5 for windows, statistical software for exact nonparametric inference. User Manual, Version 5, Chapter 8. Cambridge, MA; 2001. 24. Clogg CC, Rubin DB, Schenker N, Schultz B, Weidman L. Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. Journal of the American Statistical Association 1991; 86:68–78. 25. Haldane JBS. The estimation and signicance of the logarithm of a ratio of frequencies. Annals of Human Genetics 1956; 20:309–311. 26. Heinze G, Ploner M. Fixing the nonconvergence bug in logistic regression with SPLUS and SAS. Computer Methods and Programs in Biomedicine 2003; 71:181–187. 27. Cytel Software Corporation. LogXact7, discrete regression software featuring exact methods. User Manual, Chapter 7, November 2005. Cambridge, MA.

Copyright ? 2006 John Wiley & Sons, Ltd.

Statist. Med. 2007; 26:903–918