A COMPARATIVE STUDY ON THE PERFORMANCE OF FREQUENTIST AND BAYESIAN ESTIMATION METHODS UNDER SEPARATION IN LOGISTIC REGRESSION

. Separation is one of the most commonly encountered estimation problems in the context of logistic regression, which often occurs with small and medium sample sizes. The method of maximum likelihood (MLE; [8]) provides spuriously high parameter estimates and their standard errors under separation in logistic regression. Many researchers in social sciences utilize simple but ad-hoc solutions to overcome this issue, such as (cid:147)doing nothing strategy(cid:148), removing variable(s) from the model, and combining the levels of the categorical variable in the data causing separation etc. The limitations of these basic solutions have motivated researchers to use more appropriate and innovative estimation techniques to deal with the problem. However, the performance and comparison of these techniques have not been fully investigated yet. The main goal of this paper is to close this research gap by comparing the performance of frequentist and Bayesian estimation methods for coping with separation. A simulation study is performed to investigate the performance of asymptotic, bootstrap-based, and Bayesian estimation techniques with respect to bias, precision, and accuracy measures under separation. In line with the simulation study, a real-data example is used to illustrate how to utilize these methods to solve separation in logistic regression.


Introduction
The logistic regression is a well-founded analysis technique that can be utilized to determine the relationship between a dichotomous outcome and a set of categorical and/or continuous predictors. Although researchers in social sciences often do not encounter challenges in applying this technique to their data sets, complications may arise when a linear combination of predictors allocate the values of 90 Y. ALTINISIK outcome, which is called the separation problem [1]. To illustrate the separation problem in logistic regression, consider the simplest scenario in which a dichotomous response is predicted by a continuous predictor. Suppose that the outcome has the values of R = f0; 0; 0; 0; 0; 1; 1; 1; 1; 1g and the predictor has the values of P = f2; 7; 3; 5; 6; 9; 14; 10; 12; 16g. In this case, the values of response are zero when the values of predictor are smaller than 8 and the values of response are 1 for the values of predictor greater than 8. This implies that the probability of observing zero or one is perfectly predicted (known as complete separation) and there is nothing left to be estimated. When separation occurs, the method of maximum likelihood (MLE; [8]) does not provide a reliable set of parameter estimates and their standard errors, which in turn cause to obtain undependable test statistics. Many researchers bene…t from basic (but ad-hoc) solutions to overcome separation in logistic regression.
Since separation does not necessarily have a negative in ‡uence on all parameters in the model, some researchers do not pay special attention to this issue by simply and only reporting their results with respect to chi-square test statistics; although these statistics are only correct for non-problematic variables in the data. However, these variables often interact with problematic ones, and thus, the estimates and standard errors of these interactions should not be trusted either. Moreover, if the variable causing separation is categorical, then the estimates obtained for other variables in the model are not interpretable, since they are determined on the basis of the reference level of this categorical variable. Some researchers avoid these issues by removing the problematic variable(s) from the model. However, this approach is subject to two main drawbacks. First, discarding an important variable may end up with an inappropriate model speci…cation, and consequently, a set of bias estimates for model parameters, which is known as the omitted variable bias [24]. Second, even if a predictor causing separation has an insigni…cant (or weakly signi…cant) e¤ect on the outcome, caution should be taken when eliminating this variable from the model, since it can be a confounder. That is, the relationship between this variable and the outcome may in ‡uence the outcome's associations with other variables in the model. Another common way of coping with this issue is combining the levels of variable causing separation, which is only applicable when this variable is categorical. This approach is also not recommended not only because collapsing categories alter the research question at hand, but also because it may cause the loss of information obtained from the data [1].
In response to these challenges, many researchers focus on more complicated but powerful data analysis techniques to deal with separation in logistic regression. Heinze and Schemper [14] compare the performance of Firth's penalized maximum likelihood estimation (PMLE; [7]) against the method of maximum likelihood [8], an imputation method using Bayesian logistic regression [3], and exact logistic regression [22]. This study is limited in the sense that it investigates the performance of only these four methods with respect to (only) bias measures. In the discussion of their study, they suggest the use of Firth's method to cope with separation in logistic regression. Moreover, they state that the separation problem may not only occur in the original sample, but it may also occur in bootstrap samples. However, they do not inspect the performance of Firth's method in the context of bootstrapping. Ohkura and Kamakura [28] utilized nonparametric bootstrapping in conjunction with Firth's method to compare the performance of their bootstrap-base test against Wald and Firth's tests under separation. However, the performance of Firth's method with nonparametric bootstrapping has not been compared against any Bayesian estimation method and the usual Firth's method with respect to bias, precision and accuracy measures. This study aims at …lling this gap by investigating and comparing the performance of frequentist and Bayesian estimation methods with respect to bias, precision, and accuracy measures, respectively. Here, frequentist way of coping with separation is performed using Firth's method [7] and its counterpart with nonparametric bootstrapping [6]. The choice of prior distribution is a crucial point to solve separation in logistic regression using Bayesian methods. Thus, the Markov Chain Monte Carlo (MCMC) algorithms are utilized as Bayesian solutions to separation using seven di¤erent priors.
The outline of the paper is as follows. In Sections 2 and 3, the logistic regression and the separation problem in logistic regression are elaborated, respectively. In Section 4, three methods used to obtain the estimates of model parameters and their standard errors under separation are described. In Section 5, a simulation study is performed to investigate and compare the performance of these methods with respect to bias, precision, and accuracy measures. In Section 6, a real life example is presented to exemplify how to deal with separation using these estimation techniques in logistic regression. The paper will be concluded with a brief discussion.

Logistic Regression Modeling
The logistic regression is one of the most commonly used analysis techniques to predict a binary outcome (containing zeros and ones) in the context of generalized linear models [21]. The logistic regression model is de…ned as: i = 1; 2; :::; N; where i = E(y i ) is the expected value of the binary outcome for the ith observation, = ( 0 ; 1 ; :::; P 1 ) T 2 IR P 1 is the vector of model parameters and x T i = (1; x i1 ; x i2 ; :::; x i(P 1) ) 2 IR N P is the design matrix containing ones in the …rst column as the coe¢ cients of the intercept, 0 , and the values of the explanatory variables in the data, respectively. The logit link function, f ( i ), relates the expected values of the outcome to the linear predictor, x T i : , which is also known as the conditional probability of success.
Since the outcome containing 0's and 1's has a Bernoulli distribution with the probability of success i for the ith observation, the likelihood function of the data can be de…ned as follows: L( j y 1 ; y 2 ; ::: where y i 2 f0; 1g for i = 1; 2; :::; N . The likelihood function above is not easy to di¤erentiate, and thus, it is transformed from the original scale into the log scale: log L( j y 1 ; y 2 ; ::: The 's are estimated by maximizing the log likelihood function above using the method of maximum likelihood [8], so that the data at hand have the highest probability of being observed. This is done by di¤erentiating the log likelihood function above with respect to the 's, setting the resulting functions to zeros and solving the equations for each of 's, respectively.
Since the maximum likelihood estimates of model parameters, the^ 's, and their standard errors do not involve closed-form solutions, they are obtained numerically. This can be achieved quickly and conveniently by utilizing computer-intensive iterative methods such as the Newton-Raphson algorithm [27]. However, there may be certain situations in which even the numerical methods fail to provide parameter estimates and their standard errors. In the next section, one of these situations called the separation problem will be elaborated.

Separation Problem
The logistic regression cannot always be easily used to predict a dichotomous outcome containing zeros and ones. One common issue that arises when estimating model parameters and their standard errors in the context of logistic regression causing (nearly) perfect allocation of the values of an outcome in the data at hand is called the (quasi) complete separation problem [1]. In a regular situation in which there is no problem of (quasi) complete separation, the expected probabilities of an outcome for a logistic regression model can take values between the numbers 0 and 1. In complete separation, since a linear function of predictor(s) perfectly predicts the outcome, the expected probabilities are either 0 or 1 (and not between these values). Similarly, in quasi complete separation, since the values of an outcome almost perfectly predicted, almost all expected probabilities (but not all of them) are either 0 or 1. Figure 1 is created based on two empirical data sets given in the study of [33, p. 276], which shows the scatter plot of the values of an outcome against that of a linear predictor in the presence of complete and quasi-complete separation. As can be seen on the left panel of the …gure for the …rst data, the values of the linear predictor perfectly separate the values of the outcome. Thus, only by observing the plot, we can make a perfect inference about the predicted values of the outcome. That is, the predicted values of the outcome take the value of zero when the linear predictor is smaller than zero and take the value of one when the linear predictor is larger than one. Similarly, as can be seen on the right panel of the …gure for the second data, the values of the linear predictor nearly perfectly separate the values of the outcome, which is a sign of quasi-complete separation. In this case, the predicted values of the outcome take the value of zero, a value between zero and one (only for three observations) and the value of one, when the linear predictor is smaller than zero, equal to zero, and larger than zero, respectively. Next, it will be

Estimation Methods
The separation [1] often occurs with small and medium sample sizes when estimating model parameters and their standard errors in logistic regression. The Newton-Raphson algorithm used to obtain MLEs does not converge for (some of) model parameters when the data su¤er from separation. This nonconvergence causes spuriously high parameter estimates and standard errors [33, pp. 282-283] and results in unreliable test statistics and hypothesis testing. In response to this challenge, researchers have been paying attention to more appropriate estimation techniques than MLE to overcome separation in logistic regression. In the sequel, three of such advanced estimation methods will be elaborated, respectively.
Firth's method : Firth [7] proposed a method to improve the parameter estimates in logistic regression by reducing the bias occurs with small samples when using the 94 Y. ALTINISIK method of maximum likelihood for estimation. Since Firth's method incorporates a penalizing factor into the log likelihood in (4), it is also known as the method of penalized maximum likelihood estimation. Firth's penalized log likelihood function is de…ned as: L ( j y 1 ; y 2 ; :::; y N ) = L( j y 1 ; y 2 ; :::; y N ) + 1 2 log jI( )j; where Heinze and Schemper [14] have adopted the penalized log likelihood function above to overcome separation in the analysis of two cancer studies. Firth's method is ‡exible in the sense that it can be incorporated into nonparametric resampling techniques when estimating model parameters and their standard errors.
Firth's method with nonparametric bootstrapping: Nonparametric bootstrapping [6] is a resampling (with replacement) technique that can be used as an alternative to the method of maximum likelihood to obtain MLEs and their standard errors, when model assumptions are not satis…ed (see [34], [15, p. 44]). Nonparametric bootstrapping uses the information given in the original sample to generate, for example, B = 1000 bootstrap samples, in each of which model parameters are estimated using the method of maximum likelihood. Subsequently, it calculates the averages and standard deviations of the bootstrap estimates across these samples to obtain the overall parameter estimates and their standard errors.
The usual nonparametric bootstrapping using the method of maximum likelihood for estimation in each bootstrap sample assumes that the original sample adequately represents the population of interest, which is often not a reasonable assumption for small samples. Thus, since separation usually occurs with small and medium samples, it is not recommended to use nonparametric bootstrapping in conjunction with MLEs under separation. Nonparametric bootstrapping can still be used for a small or medium sample in the context of logistic regression when the data su¤er from separation. This can be done by replacing MLEs with PMLEs obtained using Firth's method in each bootstrap sample. The method of maximum likelihood and nonparametric bootstrapping with MLEs produce bias estimates with small samples [15], and thus, they should not be used to overcome separation in logistic regression. Bayesian methods are good alternatives to Firth's method and nonparametric bootstrapping with PMLEs to deal with separation in logistic regression.
Bayesian approach using MCMC algorithms: Bayesian estimation using Markov chain Monte Carlo (MCMC) algorithms bene…ts from prior knowledge on the distribution of model parameters and information in the data at hand to generate posterior samples, which are, in turn, utilized to obtain parameter estimates and their standard errors. The Metropolis Hastings [13,23], Gibbs sampling [10], and Hamiltonian Monte Carlo (HMC; [2,5,26]) are three of the best known MCMC algorithms that can be used to obtain the estimates of model parameters and their standard errors for small samples in logistic regression. The HMC (also known as Hybrid Monte Carlo) and Gibbs sampling algorithms are used for Bayesian estimation in this paper using the R packages "rstanarm" [12], "runjags" [4], and "bayesreg" [19].
Rainey [29] suggests to utilize two priors when estimating model parameters using Bayesian approaches under separation in logistic regression, which are Je¤rey's invariant prior [16], [35] and a weakly informative Cauchy(0, 2.5) prior [9]. Bayesian approach using Je¤rey's prior is the same with Firth's penalized maximum likelihood estimation method, since the penalty part of the log likelihood function in (5), 1 2 log jI( )j, is equal to the log of Je¤rey's prior in logistic regression [29]. Moreover, using weakly informative Cauchy(0, 2.5) prior to cope with separation in logistic regression is highly controversial. Ghosh, Li and Mitra [11] state that using a Cauchy(0, 2.5) prior imposes too much insu¢ cient information into the analysis to overcome separation in logistic regression. They show that using Cauchy(0, 2.5) prior may cause spuriously high posterior means for parameters in the presence of separation in logistic regression and may not even enable researchers to obtain these means. Their results suggest to use weakly informative priors with lighter tails than that of Cauchy(0, 2.5) prior such as Normal and Student-t (df = 7) priors. Thus, in addition to Cauchy(0, 2.5) prior, a weakly informative Normal(0, 2.5) prior (the default prior for regression coe¢ cients in rstanarm) and Student-t(0, 2.5, df = 7) prior will be utilized to obtain parameter estimates and their standard errors.
Mansournia, Geroldinger, Greenland, and Heinze [20] utilize Firth's method [7], Ridge logistic regression [31], lasso logistic regression [17], [30], and Bayesian estimation using weakly informative priors. The di¤erence between the current study and the study in Mansournia et al. [20] is threefold. First, Mansournia et al. [20] utilize Bayesian estimation using only Cauchy(0, 2.5) and Log-F(1, 1) priors. As will be shown later in this paper, Bayesian estimation using these priors does not necessarily perform well in logistic regression under separation problem. Thus, the current study also uses Bayesian estimation via Normal(0, 2.5), Student-t(0, 2.5, df = 7), and Log-F(2, 2) priors. Second, Mansournia et al. [20] do not perform a simulation study to inspect the performance of methods used in their study, while the current study compares the performance of both frequentist and Bayesian estimation methods with respect to bias, precision, and accuracy measures. Third, Mansournia et al. [20] investigate the frequentist Ridge and Lasso logistic regressions to cope with separation. Researchers often need to determine the value of a penalizing parameter ( 0; also called the tuning or shrinkage parameter utilized on all the regression coe¢ cients besides the intercept in the model) using, for example, cross-validation in order to employ these techniques to solve the problem. However, obtaining the tuning parameter is often a complicated and cumbersome task in logistic regression under separation. In many cases where the data su¤er from the separation problem the tuning parameter can be estimated as very close to zero, which means that the penalized estimates are very close to the usual MLEs. To remedy this, the current study does not inspect the usual Ridge and Lasso logistic regressions to solve the separation problem in logistic regression, but instead it utilizes their Bayesian counterparts, that is, Bayesian Ridge and Bayesian Lasso logistic regressions. Note that the tuning parameter is set to 1 in Bayesian Ridge logistic regression and 2 Exp (1) in Bayesian Lasso logistic regression for each regression coe¢ cients in the model (see [19, p. 7]).

Simulation
Steps. In this section, the performance of the methods on estimating model parameters will be compared to each other for the data sets containing separation in the context of logistic regression. The model used in the simulation is: where f ( i ) is the logit link function in (2), 0 is the intercept, 1 is the coe¢ cient of a dummy variable I i and 2 and 3 are the coe¢ cients of two continuous variables x i1 and x i2 , respectively, for i = 1; 2; :::; N . The simulation comprises the following steps: (1) Set the entries in the vector of model parameters, = ( 0 ; 1 ; 2 ; 3 ) T , equal to 1. (10) Calculate the values of the bias, precision, and accuracy measures for each method using the estimates obtained for these samples.

5.2.
Bias, precision, and accuracy measures for evaluating performance. The performance of the methods will be compared to each other using the measures of bias, precision, and accuracy given in Walther and Moore [32]. These measures are de…ned as: where p = 1 S P S s=1^ sp and j = 1 for s = 1; 2; :::; 1000 and p = 0; 1; 2; 3. The Bias p is the mean of the di¤erences between parameter p and its estimate across S = 1000 samples. Similarly, Precision p is the mean of the squared di¤erences between an estimate and its expected value (i.e., p ) in S = 1000 samples, which is calculated for each parameter, separately. The measure of accuracy for the pth parameter, Accuracy p , is the mean of the squared di¤erences between parameter p and its estimates across S = 1000 samples, which is a combination of Bias p and Precision p . Note that the term "bias"is directly related and the terms "precision" and "accuracy"are inversely related to their corresponding equations in ( 7). That is, a small value of Bias p means a low bias, while small values of Precision p and Accuracy p imply high precision and accuracy when estimating model parameters.
Another accuracy measure that can be used to investigate the performance of methods on estimating model parameters is the mean squared error (MSE), representing the estimation error for each sample in the simulation. The MSE is the total mean squared error between all parameters and their estimates: where P = 4 is the number of parameters in the model.  Table 1 displays Bias p , Precision p and Accuracy p values obtained from 1000 simulated data sets, each of which contains the separation problem. This table shows that the estimate of parameter 1 often has a higher bias and a lower precision and accuracy than that of parameters 2 and 3 , since dummy variables are more prone to su¤er from separation than continuous variables. Because of the same reason, although increasing the sample size increases the precision when estimating each parameter, this reduces the bias and improves the accuracy only for parameters 0 , 2 , and 3 , but not for parameter 1 . It seems that Firth's penalized maximum likelihood estimation and Bayesian estimation using Log-F(2, 2) prior provide smaller biases and higher precision and accuracy measures when compared to other estimation methods. Similarly, these methods have smaller MSE values (higher overall accuracy measures) when compared to other methods (see Table 2). Moreover, both tables show that Bayesian estimation may not perform well with Ridge prior, since the corresponding estimates may have spuriously high precision and accuracy values (indicating low precision and accuracy for these estimates). However, the values in these tables are point estimates, and thus, a set of graphical visualizations are designed to facilitate the interpretation of the simulation results.      . Boxplots used to interpret MSE values estimation using Cauchy(0, 2.5) and Lasso priors have higher standard errors when compared to that of other methods under investigation. It seems that Bayesian estimation using log-F(2, 2) prior involves smaller amount of bias and have higher precision in estimating model parameters when compared to other methods. Note that the …gures in the paper do not show the results for Bayesian estimation using Ridge prior, since this method produces spuriously high parameter estimates and their standard errors. Figures 4 and 5 show the squared di¤erences and the sums of squared di¤erences between the values of estimates and parameters using varying sample sizes, which are utilized to obtain Accuracy p and MSE values, respectively. Increasing the sample size improves the accuracy for each parameter, and thus, the total accuracy when estimating model parameters using each method. The estimates obtained by using Bayesian estimation with Log-F(2, 2) prior often have higher (total) accuracy measures, and thus, lower Accuracy p and MSE values, when compared to other methods. Since nonparametric bootstrapping assumes an original sample that adequately represents the population of interest, the performance of Firth's method and Firth's method with nonparametric bootstrapping better resemble each other for large sample sizes (e.g., when N = 100). It seems that Bayesian approach with weakly informative Normal(0, 2.5) prior performs better than that with Studentt(0, 2.5, df = 7) or Log-F(1, 1) prior which in turn performs better than that with Cauchy(0, 2.5) prior. This result is in line with the suggestions made in Ghosh et al. [11], which state that Cauchy(0, 2.5) prior provides too much de…cient information, and thus, instead of using this prior, Normal(0, 2.5) and Student-t(0, 2.5, df = 7) priors should be used when dealing with separation in logistic regression. 102 Y. ALTINISIK

An example: Endometrial cancer data
A study in Heinze and Schemper [14] is used to illustrate how to analyze the data at hand under separation in logistic regression. In the study, the dichotomous outcome histology (HG: 0 = grade 0-II, 1 = grade III-IV) represents the histology of the endometrium by commonly accepted risk factors for endometrial cancer patients (N = 79). This outcome is predicted by the categorical variable neovasculization (NV: 0 = absent, 1 = present) and two continuous variables pulsatility index of arteria uterina (PI) and endometrium high (EH). The logistic regression model used to analyze the endometrial cancer data is: where f (:) is the logit link function, 0 is the intercept and 1 , 2 , and 3 are the regression coe¢ cients of variables NV, PI and EH, respectively, for i = 1; 2; :::; 79.
Since there is no observation in the endometrial cancer data for NV = 1 and HG = 0, the data su¤er from quasi-complete separation, which has a detrimental e¤ect on the estimate of parameter 1 and its standard error when the estimation process is performed using the usual method of maximum likelihood. Therefore, Firth's method, Firth's method with nonparametric bootstrapping, Bayesian approach using Normal(0, 2.5), Cauchy(0, 2.5), Student-t(0, 2.5, df= 7), Log-F(1, 1), Log-F(2, 2), Ridge and Lasso priors are used to obtain parameter estimates and their standard errors (see Table 3). 2 The estimates of parameters 2 and 3 across the methods are reasonably close to each other, while the estimates of parameters 0 and 1 across the methods may di¤er from each other. Figure 6 shows that the predicted probabilities of the outcome histology for some of the observations in the data are exactly equal to 1 (in the upper right corner of the plot), when using the method of maximum likelihood for estimation, which is a sign of the quasi-complete separation problem. Bayesian approach using the MCMC algorithm with Cauchy(0, 2.5) prior does not provide a convincing solution to the separation for endometrial cancer data, since some of the predicted probabilities of the outcome are (almost) equal to 1. The plots for other methods more closely resemble the regular logistic regression plot in which predicted probabilities are between the numbers 0 and 1.
Here, several diagnostics are introduced to inspect whether the MCMC algorithm produces adequate posterior samples for parameters when using weakly informative Normal(0, 2.5), Cauchy(0, 2.5), and Student-t(0, 2.5, df = 7) priors. The potential scale reducing factor (R) and e¤ective sample size (ESS) statistics for each parameter are used to determine whether the MCMC algorithm converges properly with high estimation accuracy. These statistics are obtained by inspecting multiple chains and dissimilarities between them (default number of chains is often 4). Thê R statistic shows whether the chains converge to the same area by exploring the  [25]. Table 4 displays the values ofR and ESS statistics obtained for each parameter, where the MCMC algorithm is used with Normal(0, 2.5), Cauchy(0, 2.5), and Student-t(0, 2.5, df = 7) priors, respectively. The use of MCMC algorithm with Normal(0, 2.5) and Student-t(0, 2.5, df = 2.5) priors results in good convergence of the chains (i.e.,R = 1 for each parameter) with low autocorrelation, and consequently, high estimation accuracy (i.e., ESS > 1000 for each parameter). Although the MCMC algorithm with Cauchy(0, 2.5) prior produces good convergence of the chains for each parameter, there is a high autocorrelation and a low estimation accuracy within parameter samples, especially when looking at the relationship between the outcome and dichotomous predictor NV (i.e., ESS = 103 for parameter 1 ). Thus, the focus from now on will be particularly on parameter 1 to visually inspect the di¤erence between the MCMC 104 Y. ALTINISIK Figure 6. The values of linear predictor against predicted probabilities algorithm with weakly informative Normal(0, 2.5), Cauchy(0, 2.5), and Student-t(0, 2.5, df = 7) priors.   Figure 7 shows the histograms of marginal posterior distribution, trace plot (chains separate), autocorrelation plot (combined chains) and log posterior for parameter 1 under the three priors, respectively. A marginal posterior distribution is obtained for one single parameter by not taking other parameters in the model into account. The histograms show that the marginal posterior distribution of parameter 1 is normal when using the normal prior and is close to be normal when using the Student-t prior with df = 7 degrees of freedom, for which the mean (solid The marginal posterior of parameter 1 using the Cauchy prior has a right skewed (i.e., the mean to the right of the median) distribution. By default, the MCMC algorithm in rstanarm utilizes 2000 posterior samples of parameter 1 for each chain (i.e., 8000 samples in total), half of which are used in a warm-up phase and discarded later on before showing diagnostics and making inference. Thus, each of the four trace plots above under the three priors is created by using 1000 posterior samples of parameter 1 . Based on these plots, the chains display adequate mixing under Normal(0, 2.5) and Student-t(0, 2.5, df = 7) priors, but they may exhibit consecutive periods in positive direction under Cauchy(0, 2.5) prior. Based on the autocorrelation plots, independently from the prior distribution of parameter 1 , the correlation between variable NV and its value at lag zero is one, since the latter represents the variable itself. The height of spike at lag zero is quickly reduced to zero (and ‡uctuated around zero afterwards) with increasing values of lags under Normal(0, 2.5) and Student-t(0, 2.5, df = 7) priors for parameter 1 , respectively, which is a sign against autocorrelation. However, when using Cauchy(0, 2.5) prior for parameter 1 , the decrease in the height of spike at lag zero is relatively slow (and does not ‡uctuate considerably around zero) compared to that using Normal(0, 2.5) and Student-t(0, 2.5, df = 7) priors, which is a sign of positive autocorrelation.
The marginal posterior distribution for 1 is highly curved when using the MCMC algorithm with Cauchy(0, 2.5) prior. This causes many divergent transitions in the MCMC algorithm, which are shown by the red points in the log posterior scatter plot above. This is evidence of too large step size in the MCMC algorithm under Cauchy(0, 2.5) prior. In this case, the results of MCMC algorithm should not be trusted. The MCMC algorithm needs a smaller step size to avoid divergent transitions and to draw plausible samples from the marginal posterior distribution of 1 , which can easily be adjusted by increasing the default value of parameter in rstanarm (e.g., from 0.95 to 0.99). Table 5 shows the estimates of parameters and their standard errors and the values ofR and ESS statistics, when using Cauchy(0, 2.5) prior with divergent ( = 0:95) and non-divergent ( = 0:99) transitions, respectively. Based on this table, decreasing the step size in the MCMC algorithm by increasing the value of from 0.95 to 0.99 does not have much in- ‡uence on parameter estimates and their standard errors. Moreover, increasing the value of results in a non-convergence (i.e.,R = 1:1 for parameter 1 ) and a decrease in estimation accuracy (i.e., ESS is only 35 for parameter 1 ). Therefore, it is not recommended to use this prior to overcome separation in the endometrial cancer data.

Discussion
Researchers in social sciences commonly use simple data manipulation techniques to overcome separation in logistic regression. These solutions are often unsatisfactory and do not meet the expectations of researchers. Thus, many researchers have been paying attention to more convenient approaches for estimation, such as symptotic and bootstrap-based bias reduction methods and Bayesian methods using weakly informative priors. However, the performance of these methods have not been fully investigated yet with respect to bias, precision, and accuracy measures in the context of logistic regression.
In the simulation, three methods were used to obtain the estimates of model parameters and their standard errors: Firth's penalized maximum likelihood estimation, Firth's method with nonparametric bootstrapping, and Bayesian approach with seven di¤erent priors. In a concrete real life example, parameter estimation was performed using these three methods for the endometrial cancer data. Supplementary material contains the relevant R code for obtaining the estimates of model parameters and their standard errors for each estimation method presented in this paper. Results of the simulation study and the analysis of the endometrial cancer data have showed that although most of the methods perform well in coping with the consequences of separation problem in logistic regression, Bayesian estimation with Log-F(2, 2) prior performs better than other methods.
The choice of prior distribution in Bayesian approach plays an essential role to overcome separation in logistic regression. It was shown both by the simulation and real life example that Bayesian approach with Cauchy(0, 2.5) or Ridge prior does not provide a reliable solution to separation in logistic regression, since these priors incorporate too much detrimental information into the analysis. A more coherent weakly informative prior such as Normal(0, 2.5), Student-t(0, 2.5, df = 7), Log-F(1, 1), Log-F(2, 2), or Lasso prior should be utilized in place of Cauchy(0, 2.5) prior when dealing with separation in the data.