29 June 2023

Mastering Logistic Regression: Interpretation and Fitting

The optimal slope parameter for a logistic function sledding hill is -1.5.
A logistic function would make an amazing shape for a sledding hill. I think the optimal slope parameter would be -1.5.

Introduction

At it’s core, a logistic regression is nothing more than fitting a special function to data that takes the values between 0 and 1. The most common use cases are when the response of interest is a dichotomous variable, meaning that the variable we’re modeling can take either of exactly 2 values. For example, a coin flip can be modeled as a dichotomous variable, where heads is 1 and tails is 0. Whether someone had breakfast or not today is also dichotomous. Sometimes, dichotomous variables are created from continuous variables. For example, high blood pressure can be modeled as dichotomous by creating a cutoff value for what is considered high blood pressure (130\geq 130 systolic mm Hg). The “regression” part of the name comes in when you want to associate the response with some explanatory variables. For example, if you want to know the relationship between the number of hours you study (explanatory) and passing/failing an exam (response), we can use a logistic regression to model the strength of that relationship.

Logistic regression has found use across many disciplines, including for classification in machine learning models, ecological models of constrained growth, and estimation in retrospective studies, so it’s useful to understand the method from first statistical principles in order to fully appreciate the utility and limitations of this method.

The Logistic Function

The magic of this method starts with the shape of the function itself, with a specific parameterization1 for regression purposes. Since this function is the core of the method, we’ll spend some time developing some intuition for how this function behaves.

p(x)=11+e(β0+xβ1)\LARGE{p(x) = \frac{1}{1 + e^{-({\color{#ebcb8b}\beta_0} + x{\color{#a3be8c}\beta_1})}}}

There are a few things I'd like you to note about this function.

  • The range of this function is bounded between 00 and 11, while xx can take on any value on the real line in the domain. This also means that β0,β1\beta_0, \beta_1 are free to take on any value.
  • When we set β0=0\beta_0=0 and vary β1\beta_1, the p(x)p(x)-intercept always stays fixed at 0.50.5. In fact, β0\beta_0 by itself determines the p(x)p(x)-intercept which is why we call β0\beta_0 the intercept parameter.
  • Similarly, fixing β1=0\beta_1=0 and varying β0\beta_0 gives a completely flat line that shifts up and down implying xx is not useful information. Notice also that we get diminishing movement for values further from 0. The intercept will shift more between β0=01\beta_0=0 \rightarrow 1 than from β0=12\beta_0=1\rightarrow 2, etc.
  • The function has a symmetry about the inflection point, meaning the rate that the function increases starting from the bottom is the same as the rate the function decreases starting from the top. This implies the inflection point always occurs at y=0.5y=0.5. A step-by-step proof of this can be found on Socratic Q&A.

Now with Probability!

Having a function that maps values between 0 and 1 sets the stage perfectly for introducing probability, which will also take values between 0 and 1! To make this into a concrete example, suppose I’m interested in modeling how many fruit snacks one can eat (explanatory variable, xx) before getting a stomach ache (response variable, yy). It’s reasonable to think about the probability of such an event since I could probably have 100 fruit snacks one day and get a stomach ache and do the same another day and not get a stomach ache. It is a little ambitious to think that there are no other variables that might affect this probability, and that the entire population will have the same tolerance for fruit snacks. For example, if kids can tolerate more fruit snacks than adults, we’d likely need to model more covariates, but we’ll just keep it simple for now.

Introducing the mathematical description, we’re modeling a single data point as a dichotomous (Bernoulli) random variable YY with parameter pp. The value of YY can either be Y=1Y=1 for stomach ache, and Y=0Y=0 for not stomach ache.2 pp is the probability that we observe a stomach ache. In this case, pp will depend on the number of fruit snacks that someone eats, so we should instead write p(x)p(x), where xx is the number of fruit snacks. The probabilities are related to xx by the logistic function as p(x)p(x) is defined above. If we’re making multiple observations of YiY_i given some values of xix_i, we’ll need to introduce a subscript ii that indexes all the observations and write p(xi)p(x_i) since the probability is different for each observation. Finally, the coefficients for the logistic regression are β0\beta_0 and β1\beta_1, that (for now) we assume are given to us. We can include the dependence explicitly in p(xi;β0,β1)p(x_i; \beta_0, \beta_1), using a ”;” to separate our observations from our parameters. In symbols, this is all summarized by the statement

YixiBernoulli(p(xi;β0,β1))Y_i | x_i \sim \text{Bernoulli}(p(x_i; \beta_0, \beta_1))

If you’re curious, I read this, “Y sub i conditional on x sub i is distributed as a Bernoulli random variable with parameter p of x sub i given beta 0 and beta 1”. Often we’ll just remember that pp depends on our parameters and data, so we can shorten the notation to just pi=p(xi;β0,β1)p_i = p(x_i; \beta_0, \beta_1). I’ll shorten this to YiBernoulli(pi)Y_i \sim \text{Bernoulli}(p_i) since we’re thinking of xix_i as fixed values as well, sacrificing some precision for clarity.

This is our data model. We have now attached a probabilistic nature to the data events that we observe. Our data model provides a description of the probability of observing each event, known as a probability mass function (pmf). For the bernoulli distribution, Bernoulli(pi)\text{Bernoulli}(p_i), the pmf is given by:

Pr(Yi=yi;pi)=piyi(1pi)1yi={piif yi=11piif yi=0\begin{aligned} Pr(Y_i = y_i ; p_i) &= p_i^{y_i}(1 - p_i)^{1 - y_i} \\[1em] &= \begin{cases} p_i & \text{if } y_i = 1 \\ 1-p_i & \text{if } y_i = 0 \end{cases} \end{aligned}

This is a notationally heavy way to say the probability of getting a stomach ache (Yi=1Y_i = 1) is pip_i and not getting one is 1 minus that. The precision of such a statement takes some getting used to, but trace the function through to make sure you can map the english back to the mathematical statement.

  • Yi=yiY_i = y_i is a common point of confusion. Just think that YiY_i is before we observe the data, so we can talk about the probability of such a random variable. yiy_i is the actual value, and has no probability associated with it, you either had the stomach ache or not. For interpretation, you can always think about “plugging” in the value of the lower case variable, and we can read Pr(Yi=1)Pr(Y_i = 1) as “probability of getting a stomach ache”.
  • Pr(Yi=yi;pi)Pr(Y_i = y_i ; p_i) again separates data from parameters, and sometimes the parameters of the pmf are left off. We need to be given our parameter pip_i (which in turn is a function of number of fruit snacks and β0,β1\beta_0, \beta_1), in order to state the probability of observing an event YiY_i.

Interpretation of Coefficients

With p(x)p(x) understood as the probability of an event, we can determine the meaning of β0,β1\beta_0, \beta_1 on the probability of an event. We need to invert the logistic function to get:

β0+xβ1=log(p(x)1p(x))log odds\beta_0 + x\beta_1 = {\color{#ebcb8b} \underbrace{\color{#eceff4}\log\left(\frac{p(x)}{1 - p(x)}\right)}_{\text{log odds}}}
Steps
p(x)=11+exp((β0+xβ1))1p(x)=1+exp((β0+xβ1))1p(x)1=exp((β0+xβ1))log(1p(x)p(x))=(β0+xβ1)log(p(x)1p(x))=β0+xβ1\begin{aligned} p(x) &= \frac{1}{1 + \exp(-(\beta_0 + x\beta_1))} \\ \frac{1}{p(x)} &= 1 + \exp(-(\beta_0 + x\beta_1)) \\ \frac{1}{p(x)} - 1 &= \exp(-(\beta_0 + x\beta_1)) \\ \log\left(\frac{1 - p(x)}{p(x)}\right) &=-(\beta_0 + x\beta_1) \\ \log\left(\frac{p(x)}{1 - p(x)}\right) &= \beta_0 + x\beta_1 \\ \end{aligned}

The term appearing on the right hand side has a name — we call it the log odds. The term “odds” is the same as you would use colloquially, as in, if you had a 1/31/3 chance of winning a game, your odds are 1:2. If you plug in p(x)=1/3p(x) = 1/3 into the formula above, the log odds on the right hand side is log(12)\log(\frac{1}{2}). Thus, a 1 unit increase in β0\beta_0 implies that there is a 1 unit increase of the log odds of the event happening. Similarly, a 1 unit increase in xx implies that there is a β1\beta_1 unit increase on the scale of log odds. If you’re a “show me the math” type person, suppose we have two observations of someone eating 55 fruit snacks, and someone else eating 56 fruit snacks, x1=55,x2=56x_1 = 55, x_2 = 56 (a one unit difference in xx). If we take the difference of these two scenarios,

β0+x2β1=log(p(x2)1p(x2))β0+x1β1=log(p(x1)1p(x1))(x2x1)β1=log odds(x2)log odds(x1)β1=log odds(x2)log odds(x1)(x2x1)\begin{aligned} \beta_0 + x_2\beta_1 &= \log\left(\frac{p(x_2)}{1 - p(x_2)}\right) \\[1em] -\qquad \beta_0 + x_1\beta_1 &= \log\left(\frac{p(x_1)}{1 - p(x_1)}\right) \\[1em] \hline \\ (x_2 - x_1)\beta_1 &= \text{log odds}(x_2)- \text{log odds}(x_1) \\[1em] \beta_1 &= \frac{\text{log odds}(x_2)- \text{log odds}(x_1)}{(x_2 - x_1)} \end{aligned}

We say that β1\beta_1 is the slope on the log odds scale, and β0\beta_0 is the intercept on the log odds scale when x=0x = 0. In our scenario of x1=55,x2=56x_1 = 55, x_2 = 56 with 1 fruit snack difference (where the denominator is zero), β1\beta_1 is directly interpreted as the difference of log odds.

To be frank, interpreting the coefficients on a log odds scale is quite futile — as shown above in the logistic regression function, changes on the log odds scale depends on where you are on the scale, with larger changes in probability when you are close to 0. The most I interpret, is if the log odds are positive, this means the probability is greater than half.

Even exponentiating both sides to get something on the “odds” scale is quite dubious. I suspect the reason odds are used so often in the gambling industry is because they prey on the fact that odds are fraught with misinterpretations. Seeing an odds of 1:2, you’ll likely jump to thinking about a probability of 1/2, yet there’s quite a large difference between winning probabilities of 1/3 and 1/2. Inversely, when talking about “payouts”, for example on a craps table, you’ll commonly see that an Any 7 bet pays “5 for 1”, whose true meaning is a 4:1 payout, you get 4$ for every 1$ you put down. When it comes to taking your money, the casino will make it seem it’s less likely you’ll lose money (odds), but for paying you out, they’ll make it seem that you get more money (proportion)! Mixing the usage of this is intentionally confusing, and that’s not even discussing “difference” of odds, which is what exp(β1)\exp(\beta_1) would mean, and still difficult to interpret. Avoid odds if you can.3

My recommendation is to use the function to translate things onto the probability scale, you’ll likely have a more intuitive understanding of the values (even better yet, is to plot the entire probability function as a function of your variable). However, one of my professors pointed me toward a Car Talk riddle about potatoes that highlights some misconceptions about the probability scale as well. Be careful with interpretation, it’s very easy to get tripped up!

Fitting the Model

The fitted logistic regression model is shown in the equation above. As you drag the dots around, do you agree with how the function is fitting? Is there anything surprising about the function fit? Take note of situations that the function looks linear. Also note something embarrassing happens when you completely separate the datapoints that we’ll discuss later.

Now that we know the general form and interpretation of the logistic function, we can take our data and find the best fit model. Our observations take the form of either 1 or 0, so the drag points are constrained. A major question is how do we know what is “best fit”. There are various criteria for determining that, but most will be centered around the idea of likelihood.

The Likelihood Principle

Likelihood is a frequently encountered principle for estimating parameters given some data. It is different from the concept of “probability”, even though commonly we’d use those words interchangeably. In fact, the fundamental idea of likelihood is that it’s kind of the “opposite” of probability. If we think about our data model as the statistical world, and our observations as the real world, the probability function is how we can move from our statistical model to the observations. In turn, the likelihood function uses what we observe in the real world to inform us of better parameters for our data model in the statistical world. We feed parameters into our probability function, and we feed data into our likelihood function.

Data ModelparametersProbability FunctionestimategenerateLikelihood FunctiondataObservations\begin{CD} \text{\color{#ebcb8b}Data Model} @>\text{parameters}>> \text{Probability Function} \\ @A\text{estimate}AA @VV\text{generate}V \\ \text{Likelihood Function} @<\text{data}<< \text{\color{#a3be8c}Observations} \end{CD}

The probability function allows us to think how observations in the real world come from our data model in the statistical world. The likelihood function allows us to move in the other direction and think about “likely” data models given what we have observed in the real world.

Let’s skip thinking about the regression p(xi)p(x_i) for now, and just think about observations from a data model with a single parameter pp. The definition of likelihood for a single observation from Bernoulli distribution is:

L(p;y)=Pr(Y=y;p)\mathcal{L}(p; y) = Pr(Y = y; p)

For multiple independent observations, we multiply the probabilities,

L(p;y1,y2,,yn)=i=1nPr(Yi=yi;p)\mathcal{L}(p; y_1, y_2, \dots, y_n) = \prod_{i=1}^n Pr(Y_i = y_i; p)

What?! Likelihood is defined by the probability function?! So does this mean likelihood can be interpreted as the probability that that pp is a particular value? Unfortunately no — and I get why this is confusing at first. In frequentist statistics, parameters are thought of as fixed values of the population, i.e. there is no randomness associated with parameters. Hence, it does not make sense to discuss the probability that pp takes a certain value.4 If you pause and think about what the definition says though, it will make sense because there must be a correspondence between the data model and our observations. It’s a two way road, but it’s just a matter of perspective which direction you’re currently moving in.

Substituting in our pmf for a Bernoulli distribution above, we get

L(p;y1,y2,,yn)=p(inyi)(1p)(ninyi)\mathcal{L}(p; y_1, y_2, \dots, y_n) = p^{\left(\sum_i^n y_i\right)}(1 - p)^{\left(n - \sum_i^n y_i\right)}

Higher values of L(p)\mathcal{L(p)} mean that value of pp is more likely to have generated the data that we fed in. Relative values of likelihood are also meaningful, in fact, this forms the basis of the likelihood ratio test. In order to find the most likely value of pp given our data, we would try maximizing the function. We call the value of pp that maximizes this function the maximum likelihood estimate (MLE) of pp. In order to denote estimates of parameters, you’ll commonly see little hats on the parameter, p^MLE\widehat p_{MLE}. Since the objective is to maximize this function, we normally take the log of the likelihood function, which doesn’t change the maximum since log is a monotonic function. The log likelihood allows us to avoid dealing with the exponential terms, and the expression is,

(p;y1,y2,,yn)=(inyi)logp+(ninyi)log(1p)\ell(p; y_1, y_2, \dots, y_n) = \left({\color{#ebcb8b}\sum_i^n y_i}\right) \log p + \left({\color{#a3be8c}n} - {\color{#ebcb8b}\sum_i^n y_i}\right) \log (1 - p)
  • I color coded the sliders, but prop is more a proxy for controlling the number of observed yiy_i. The observations are actually calculated by the function inyi=round(n×prop)\sum_i^n y_i = \mathrm{round}(n \times \text{prop}). The MLE will bounce around due to that round function since our observations need to be a whole number.
  • The maximum of the function (MLE) tends to follow the proportion of 1's observed in the data, as expected. If we saw a baseball player hit 20 / 100 pitches, we'd instinctively say they have a 20% chance of hitting pitches.
  • Notice changing nn does not change the location of the maximum (other than rounding) but it does change the curvature of the likelihood around the maximum. The curvature is intimately related to the standard error, so it would make sense that as we observe more data with the same proportion of 1's, we'd be more certain about our estimate of pp.

In this situation, it’s not difficult to calculate a closed form solution for the maximum likelihood estimate of the function. We’ll take the derivative and set it equal to 0, and solve for pp to get:

p^MLE=inyin\widehat p_{\tiny MLE} = \frac{\sum_i^n y_i}{n}
proof
(p;y1,y2,,yn)p=(inyi)p(ninyi)1p=0(inyi)p=(ninyi)1p(inyi)(1p)=(ninyi)pinyiinyip=npinyipinyi=npp^MLE=inyin\begin{aligned} \frac{\partial \ell(p; y_1, y_2, \dots, y_n)}{\partial p} &= \frac{\left(\sum_i^n y_i\right)}{p} - \frac{\left(n - \sum_i^n y_i\right)}{1 - p} = 0\\ \frac{\left(\sum_i^n y_i\right)}{p} &= \frac{\left(n - \sum_i^n y_i\right)}{1 - p} \\ \left(\sum_i^n y_i\right)(1 - p) &= \left(n - \sum_i^n y_i\right)p \\ \sum_i^n y_i - \sum_i^n y_ip &= np - \sum_i^n y_ip \\ \sum_i^n y_i &= np \\ \widehat p_{\tiny MLE} &= \frac{\sum_i^n y_i}{n} \end{aligned}

There’s a good reason the likelihood principle is so prominent — the likelihood function is very useful for guiding us toward good estimates of parameters for our model. However, finding an estimate for β0\beta_0 and β1\beta_1 is not as straightforward when we try to associate data with our logistic function, and no closed form solution exists like the simplified case.

Likelihood for Regression

The likelihood function for logistic regression is instead a function of the parameters β0,β1\beta_0, \beta_1. Instead of just the number of 1’s and 0’s we observe in the data y=y1,y2,,yn\bm y = y_1, y_2, \ldots, y_n, we must also take into consideration the values of x=x1,x2,,xn\bm x = x_1, x_2, \ldots, x_n that we observe. In a sense, the likelihood function transfers along to the next stage of modeling since pip_i itself is subsequently a function of β0,β1\beta_0, \beta_1 and xix_i.The likelihood is given by substituting the logistic function for pp in the binary likelihood function. If we simplify the expression, we get

(β0,β1;x,y)=inlog(1+eβ0+β1xi)+inyi(β0+β1xi)\begin{aligned} \ell(\beta_0, \beta_1; \bm x, \bm y) &= -\sum_i^n \log\left(1 + e^{\beta_0 + \beta_1 x_i}\right) + \sum_i^n y_i(\beta_0 + \beta_1 x_i) \end{aligned}
proof

For a single xi,yix_i, y_i, we have:

(β0,β1;xi,yi)=yilogpi+(1yi)log(1pi)=yilog(pi1pi)log odds+log(1pi)\begin{aligned} \ell(\beta_0, \beta_1;x_i, y_i) &= y_i \log p_i + (1 - y_i) \log (1 - p_i) \\ &= y_i \underbrace{\log \left(\frac{p_i}{1 - p_i}\right)}_{\text{log odds}} + \log (1 - p_i) \\ \end{aligned}

We found in the interpretation section that the log odds is the same as the linear predictor, logpi1pi=β0+β1xi\log \frac{p_i}{1-p_i} = \beta_0 + \beta_1 x_i, and substituting with logistic function,

=yi(β0+β1xi)+log(1pi)=yi(β0+β1xi)+log(111+e(β0+β1xi))=yi(β0+β1xi)+log((1+e(β0+β1xi))11+e(β0+β1xi))=yi(β0+β1xi)+log((e(β0+β1xi))1+e(β0+β1xi)(eβ0+β1xi)(eβ0+β1xi))=yi(β0+β1xi)log(1+eβ0+β1xi)\begin{aligned} \ldots &= y_i (\beta_0 + \beta_1 x_i) + \log (1 - p_i) \\ &= y_i (\beta_0 + \beta_1 x_i) + \log(1 - \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_i)}}) \\ &= y_i (\beta_0 + \beta_1 x_i) + \log\left(\frac{(1 + e^{-(\beta_0 + \beta_1 x_i)}) - 1}{1 + e^{-(\beta_0 + \beta_1 x_i)}}\right) \\ &= y_i (\beta_0 + \beta_1 x_i) + \log\left(\frac{(e^{-(\beta_0 + \beta_1 x_i)})}{1 + e^{-(\beta_0 + \beta_1 x_i)}}\frac{(e^{\beta_0 + \beta_1 x_i})}{(e^{\beta_0 + \beta_1 x_i})}\right) \\ &= y_i (\beta_0 + \beta_1 x_i) - \log\left(1 + e^{\beta_0 + \beta_1 x_i}\right) \end{aligned}

For multiple observations, we just sum the log-likelihoods, and we get the expression shown.

It’s certainly not a friendly expression, and we are now dealing with two parameters β0,β1\beta_0, \beta_1 instead of just a single pp, so we are optimizing over a surface instead of just a curve, but we can still visualize the likelihood.

Hauck-Donner Effect

Above we saw that the logistic function “disappears” when we separate the data. Can you rationalize why this happens as you play around with the likelihood plot and the logistic regression function? The maximum likelihood no longer exists (or not numerically stable) in these situations because the likelihood surface becomes very flat, and the coefficients escape off to infinity. We can make the magnitude of β1\beta_1 arbitrarily large, because the logistic function will become sharper and there will also be many values that β0\beta_0 can take on. This phenomenon is known as the Hauck-Donner effect, and happens often in logistic regression, especially in higher dimensional spaces. It’s not the only way this can happen, but it’s important to keep in mind as you diagnose possible issues with your model. One sign is that the standard errors of your β\beta coefficients will be very large.

01p(x)x

When β1\beta_1 \rightarrow \infty, the logistic function will start to look like a vertical line separating the points, while β0\beta_0 still controls the location of the separation. The reason the likelihood fails to give estimates in perfectly separated data is because it can’t choose between any of these possible values.

There are various ways of dealing with this issue, philosophical by nature:

  • If you’re using logistic regression for classification, you can choose any of the values and get the same classification. You can also consider using a simpler model, like a threshold on xx.
  • Change your model by including different or excluding variables. It’s possible that with this separation, you were asking an ill-posed question in the first place, and a different formulation should be considered.
  • You can also adjust the likelihood function with a penalty for large values of coefficients, commonly a ridge or lasso penalty. This is the approach of python’s scikit-learn implementation.
  • Changing to a bayesian framework, if we add a prior distribution on the coefficients, we can get a well defined posterior distribution. bayesglm in R package “arm” uses this approach.
  • Other solutions are listed on this stack exchange post.

Optimization Topics

Given that this is a relatively standard nonlinear (convex) optimization problem, you have a choice of many methods that will give you the same optimal answer. I won’t get too much into the nitty gritty, as chances are this will be mostly done already by a computing program. Probably the most common implementations are:

  • Gradient Descent — A general purpose descent algorithm that is normally used for simplicity. It’s not as efficient as other methods.
  • Newton Raphson — A faster algorithm that uses the second derivative of the likelihood function to find the optimal solution. However looking at the surface of the likelihood function, a quadratic approximation is not always reliable especially as the likelihood function becomes flatter.
  • Iteratively Reweighted Least Squares — Used for models in generalized linear models because each iteration is solving a least squares problem, which also contains useful statistical information by default. This method is the same as “Fisher-Scoring”, which is similar to Newton-Raphson, but instead uses the expected value of the second derivative of the likelihood. This is the method used in R’s glm.fit function.

If you come from machine learning or engineering disciplines, it’s common instead to think about minimizing loss functions instead of maximizing likelihood functions. There is a correspondence. In the case of logistic regression, maximizing the likelihood is equivalent to minimizing the cross-entropy of the fitted probabilities of our model and the data values. There is an additional factor of 1n\frac{1}{n} that does not change the results of the optimization.

Footnotes

  1. Note there are other related parameterizations floating around, but the one I give is most common in statistics because the parameters are linearly related with each other. This allows the function to fit into the framework of Generalized Linear Model theory.

  2. As is convention in statistics, the capital YY means we don’t know the value yet (is random, or not yet observed) but a lower case yy means that it’s known (fixed, or already observed).

  3. Odds do have a tremendous advantage in (mostly) epidemiological contexts, in that if you are trying to estimate the odds ratio of a disease conditional on some sort of exposure, an important property is that the odds ratio is symmetric with respect to the role of disease and exposure. This implies that you can estimate the odds of a disease conditional on exposure, the same that you would estimate exposure conditional on disease, i.e. you can use retrospective studies to estimate quantities you’d normally be interested in prospective studies! Note odds ratios are NOT the same as relative risk, another common quantity in epidemiology.

  4. Likelihood also differs from posterior probability in bayesian statistics. In a bayesian framework, we do think about parameters as being random values, and thus we are able to make probabilistic statements about them. Bayes Theorem describes the relationship between the prior probability, posterior probability and likelihood function.