29 June 2023

Mastering Logistic Regression: Interpretation and Fitting

The optimal slope parameter for a logistic function sledding hill is -1.5. — A logistic function would make an amazing shape for a sledding hill. I think the optimal slope parameter would be -1.5.

Introduction

At it’s core, a logistic regression is nothing more than fitting a special function to data that takes the values between 0 and 1. The most common use cases are when the response of interest is a dichotomous variable, meaning that the variable we’re modeling can take either of exactly 2 values. For example, a coin flip can be modeled as a dichotomous variable, where heads is 1 and tails is 0. Whether someone had breakfast or not today is also dichotomous. Sometimes, dichotomous variables are created from continuous variables. For example, high blood pressure can be modeled as dichotomous by creating a cutoff value for what is considered high blood pressure ( $\geq 130$ systolic mm Hg). The “regression” part of the name comes in when you want to associate the response with some explanatory variables. For example, if you want to know the relationship between the number of hours you study (explanatory) and passing/failing an exam (response), we can use a logistic regression to model the strength of that relationship.

Logistic regression has found use across many disciplines, including for classification in machine learning models, ecological models of constrained growth, and estimation in retrospective studies, so it’s useful to understand the method from first statistical principles in order to fully appreciate the utility and limitations of this method.

The Logistic Function

The magic of this method starts with the shape of the function itself, with a specific parameterization¹ for regression purposes. Since this function is the core of the method, we’ll spend some time developing some intuition for how this function behaves.

\LARGE{p(x) = \frac{1}{1 + e^{-({\color{#ebcb8b}\beta_0} + x{\color{#a3be8c}\beta_1})}}}

\color{#ebcb8b} \beta_0 =

0.0

\color{#a3be8c} \beta_1 =

1.0

There are a few things I'd like you to note about this function.

The range of this function is bounded between $0$ and $1$ , while $x$ can take on any value on the real line in the domain. This also means that $\beta_0, \beta_1$ are free to take on any value.
When we set $\beta_0=0$ and vary $\beta_1$ , the $p(x)$ -intercept always stays fixed at $0.5$ . In fact, $\beta_0$ by itself determines the $p(x)$ -intercept which is why we call $\beta_0$ the intercept parameter.
Similarly, fixing $\beta_1=0$ and varying $\beta_0$ gives a completely flat line that shifts up and down implying $x$ is not useful information. Notice also that we get diminishing movement for values further from 0. The intercept will shift more between $\beta_0=0 \rightarrow 1$ than from $\beta_0=1\rightarrow 2$ , etc.
The function has a symmetry about the inflection point, meaning the rate that the function increases starting from the bottom is the same as the rate the function decreases starting from the top. This implies the inflection point always occurs at $y=0.5$ . A step-by-step proof of this can be found on Socratic Q&A.

Now with Probability!

Having a function that maps values between 0 and 1 sets the stage perfectly for introducing probability, which will also take values between 0 and 1! To make this into a concrete example, suppose I’m interested in modeling how many fruit snacks one can eat (explanatory variable, $x$ ) before getting a stomach ache (response variable, $y$ ). It’s reasonable to think about the probability of such an event since I could probably have 100 fruit snacks one day and get a stomach ache and do the same another day and not get a stomach ache. It is a little ambitious to think that there are no other variables that might affect this probability, and that the entire population will have the same tolerance for fruit snacks. For example, if kids can tolerate more fruit snacks than adults, we’d likely need to model more covariates, but we’ll just keep it simple for now.

Introducing the mathematical description, we’re modeling a single data point as a dichotomous (Bernoulli) random variable $Y$ with parameter $p$ . The value of $Y$ can either be $Y=1$ for stomach ache, and $Y=0$ for not stomach ache.² $p$ is the probability that we observe a stomach ache. In this case, $p$ will depend on the number of fruit snacks that someone eats, so we should instead write $p(x)$ , where $x$ is the number of fruit snacks. The probabilities are related to $x$ by the logistic function as $p(x)$ is defined above. If we’re making multiple observations of $Y_i$ given some values of $x_i$ , we’ll need to introduce a subscript $i$ that indexes all the observations and write $p(x_i)$ since the probability is different for each observation. Finally, the coefficients for the logistic regression are $\beta_0$ and $\beta_1$ , that (for now) we assume are given to us. We can include the dependence explicitly in $p(x_i; \beta_0, \beta_1)$ , using a ”;” to separate our observations from our parameters. In symbols, this is all summarized by the statement

Y_i | x_i \sim \text{Bernoulli}(p(x_i; \beta_0, \beta_1))

If you’re curious, I read this, “Y sub i conditional on x sub i is distributed as a Bernoulli random variable with parameter p of x sub i given beta 0 and beta 1”. Often we’ll just remember that $p$ depends on our parameters and data, so we can shorten the notation to just $p_i = p(x_i; \beta_0, \beta_1)$ . I’ll shorten this to $Y_i \sim \text{Bernoulli}(p_i)$ since we’re thinking of $x_i$ as fixed values as well, sacrificing some precision for clarity.

This is our data model. We have now attached a probabilistic nature to the data events that we observe. Our data model provides a description of the probability of observing each event, known as a probability mass function (pmf). For the bernoulli distribution, $\text{Bernoulli}(p_i)$ , the pmf is given by:

\begin{aligned} Pr(Y_i = y_i ; p_i) &= p_i^{y_i}(1 - p_i)^{1 - y_i} \\[1em] &= \begin{cases} p_i & \text{if } y_i = 1 \\ 1-p_i & \text{if } y_i = 0 \end{cases} \end{aligned}

This is a notationally heavy way to say the probability of getting a stomach ache ( $Y_i = 1$ ) is $p_i$ and not getting one is 1 minus that. The precision of such a statement takes some getting used to, but trace the function through to make sure you can map the english back to the mathematical statement.

$Y_i = y_i$ is a common point of confusion. Just think that $Y_i$ is before we observe the data, so we can talk about the probability of such a random variable. $y_i$ is the actual value, and has no probability associated with it, you either had the stomach ache or not. For interpretation, you can always think about “plugging” in the value of the lower case variable, and we can read $Pr(Y_i = 1)$ as “probability of getting a stomach ache”.
$Pr(Y_i = y_i ; p_i)$ again separates data from parameters, and sometimes the parameters of the pmf are left off. We need to be given our parameter $p_i$ (which in turn is a function of number of fruit snacks and $\beta_0, \beta_1$ ), in order to state the probability of observing an event $Y_i$ .

Interpretation of Coefficients

With $p(x)$ understood as the probability of an event, we can determine the meaning of $\beta_0, \beta_1$ on the probability of an event. We need to invert the logistic function to get:

\beta_0 + x\beta_1 = {\color{#ebcb8b} \underbrace{\color{#eceff4}\log\left(\frac{p(x)}{1 - p(x)}\right)}_{\text{log odds}}}

\begin{aligned} p(x) &= \frac{1}{1 + \exp(-(\beta_0 + x\beta_1))} \\ \frac{1}{p(x)} &= 1 + \exp(-(\beta_0 + x\beta_1)) \\ \frac{1}{p(x)} - 1 &= \exp(-(\beta_0 + x\beta_1)) \\ \log\left(\frac{1 - p(x)}{p(x)}\right) &=-(\beta_0 + x\beta_1) \\ \log\left(\frac{p(x)}{1 - p(x)}\right) &= \beta_0 + x\beta_1 \\ \end{aligned}

The term appearing on the right hand side has a name — we call it the log odds. The term “odds” is the same as you would use colloquially, as in, if you had a $1/3$ chance of winning a game, your odds are 1:2. If you plug in $p(x) = 1/3$ into the formula above, the log odds on the right hand side is $\log(\frac{1}{2})$ . Thus, a 1 unit increase in $\beta_0$ implies that there is a 1 unit increase of the log odds of the event happening. Similarly, a 1 unit increase in $x$ implies that there is a $\beta_1$ unit increase on the scale of log odds. If you’re a “show me the math” type person, suppose we have two observations of someone eating 55 fruit snacks, and someone else eating 56 fruit snacks, $x_1 = 55, x_2 = 56$ (a one unit difference in $x$ ). If we take the difference of these two scenarios,

\begin{aligned} \beta_0 + x_2\beta_1 &= \log\left(\frac{p(x_2)}{1 - p(x_2)}\right) \\[1em] -\qquad \beta_0 + x_1\beta_1 &= \log\left(\frac{p(x_1)}{1 - p(x_1)}\right) \\[1em] \hline \\ (x_2 - x_1)\beta_1 &= \text{log odds}(x_2)- \text{log odds}(x_1) \\[1em] \beta_1 &= \frac{\text{log odds}(x_2)- \text{log odds}(x_1)}{(x_2 - x_1)} \end{aligned}

We say that $\beta_1$ is the slope on the log odds scale, and $\beta_0$ is the intercept on the log odds scale when $x = 0$ . In our scenario of $x_1 = 55, x_2 = 56$ with 1 fruit snack difference (where the denominator is zero), $\beta_1$ is directly interpreted as the difference of log odds.

To be frank, interpreting the coefficients on a log odds scale is quite futile — as shown above in the logistic regression function, changes on the log odds scale depends on where you are on the scale, with larger changes in probability when you are close to 0. The most I interpret, is if the log odds are positive, this means the probability is greater than half.

Even exponentiating both sides to get something on the “odds” scale is quite dubious. I suspect the reason odds are used so often in the gambling industry is because they prey on the fact that odds are fraught with misinterpretations. Seeing an odds of 1:2, you’ll likely jump to thinking about a probability of 1/2, yet there’s quite a large difference between winning probabilities of 1/3 and 1/2. Inversely, when talking about “payouts”, for example on a craps table, you’ll commonly see that an Any 7 bet pays “5 for 1”, whose true meaning is a 4:1 payout, you get 4$ for every 1$ you put down. When it comes to taking your money, the casino will make it seem it’s less likely you’ll lose money (odds), but for paying you out, they’ll make it seem that you get more money (proportion)! Mixing the usage of this is intentionally confusing, and that’s not even discussing “difference” of odds, which is what $\exp(\beta_1)$ would mean, and still difficult to interpret. Avoid odds if you can.³

My recommendation is to use the function to translate things onto the probability scale, you’ll likely have a more intuitive understanding of the values (even better yet, is to plot the entire probability function as a function of your variable). However, one of my professors pointed me toward a Car Talk riddle about potatoes that highlights some misconceptions about the probability scale as well. Be careful with interpretation, it’s very easy to get tripped up!

Fitting the Model

The fitted logistic regression model is shown in the equation above. As you drag the dots around, do you agree with how the function is fitting? Is there anything surprising about the function fit? Take note of situations that the function looks linear. Also note something embarrassing happens when you completely separate the datapoints that we’ll discuss later.

Now that we know the general form and interpretation of the logistic function, we can take our data and find the best fit model. Our observations take the form of either 1 or 0, so the drag points are constrained. A major question is how do we know what is “best fit”. There are various criteria for determining that, but most will be centered around the idea of likelihood.

The Likelihood Principle

Likelihood is a frequently encountered principle for estimating parameters given some data. It is different from the concept of “probability”, even though commonly we’d use those words interchangeably. In fact, the fundamental idea of likelihood is that it’s kind of the “opposite” of probability. If we think about our data model as the statistical world, and our observations as the real world, the probability function is how we can move from our statistical model to the observations. In turn, the likelihood function uses what we observe in the real world to inform us of better parameters for our data model in the statistical world. We feed parameters into our probability function, and we feed data into our likelihood function.

\begin{CD} \text{\color{#ebcb8b}Data Model} @>\text{parameters}>> \text{Probability Function} \\ @A\text{estimate}AA @VV\text{generate}V \\ \text{Likelihood Function} @<\text{data}<< \text{\color{#a3be8c}Observations} \end{CD}

The probability function allows us to think how observations in the real world come from our data model in the statistical world. The likelihood function allows us to move in the other direction and think about “likely” data models given what we have observed in the real world.

Let’s skip thinking about the regression $p(x_i)$ for now, and just think about observations from a data model with a single parameter $p$ . The definition of likelihood for a single observation from Bernoulli distribution is:

\mathcal{L}(p; y) = Pr(Y = y; p)

For multiple independent observations, we multiply the probabilities,

\mathcal{L}(p; y_1, y_2, \dots, y_n) = \prod_{i=1}^n Pr(Y_i = y_i; p)

What?! Likelihood is defined by the probability function?! So does this mean likelihood can be interpreted as the probability that that $p$ is a particular value? Unfortunately no — and I get why this is confusing at first. In frequentist statistics, parameters are thought of as fixed values of the population, i.e. there is no randomness associated with parameters. Hence, it does not make sense to discuss the probability that $p$ takes a certain value.⁴ If you pause and think about what the definition says though, it will make sense because there must be a correspondence between the data model and our observations. It’s a two way road, but it’s just a matter of perspective which direction you’re currently moving in.

Substituting in our pmf for a Bernoulli distribution above, we get

\mathcal{L}(p; y_1, y_2, \dots, y_n) = p^{\left(\sum_i^n y_i\right)}(1 - p)^{\left(n - \sum_i^n y_i\right)}

Higher values of $\mathcal{L(p)}$ mean that value of $p$ is more likely to have generated the data that we fed in. Relative values of likelihood are also meaningful, in fact, this forms the basis of the likelihood ratio test. In order to find the most likely value of $p$ given our data, we would try maximizing the function. We call the value of $p$ that maximizes this function the maximum likelihood estimate (MLE) of $p$ . In order to denote estimates of parameters, you’ll commonly see little hats on the parameter, $\widehat p_{MLE}$ . Since the objective is to maximize this function, we normally take the log of the likelihood function, which doesn’t change the maximum since log is a monotonic function. The log likelihood allows us to avoid dealing with the exponential terms, and the expression is,

\ell(p; y_1, y_2, \dots, y_n) = \left({\color{#ebcb8b}\sum_i^n y_i}\right) \log p + \left({\color{#a3be8c}n} - {\color{#ebcb8b}\sum_i^n y_i}\right) \log (1 - p)

n = 80

prop = 0.30

MLE = 0.30

I color coded the sliders, but prop is more a proxy for controlling the number of observed $y_i$ . The observations are actually calculated by the function $\sum_i^n y_i = \mathrm{round}(n \times \text{prop})$ . The MLE will bounce around due to that round function since our observations need to be a whole number.
The maximum of the function (MLE) tends to follow the proportion of 1's observed in the data, as expected. If we saw a baseball player hit 20 / 100 pitches, we'd instinctively say they have a 20% chance of hitting pitches.
Notice changing $n$ does not change the location of the maximum (other than rounding) but it does change the curvature of the likelihood around the maximum. The curvature is intimately related to the standard error, so it would make sense that as we observe more data with the same proportion of 1's, we'd be more certain about our estimate of $p$ .

In this situation, it’s not difficult to calculate a closed form solution for the maximum likelihood estimate of the function. We’ll take the derivative and set it equal to 0, and solve for $p$ to get:

\widehat p_{\tiny MLE} = \frac{\sum_i^n y_i}{n}

\begin{aligned} \frac{\partial \ell(p; y_1, y_2, \dots, y_n)}{\partial p} &= \frac{\left(\sum_i^n y_i\right)}{p} - \frac{\left(n - \sum_i^n y_i\right)}{1 - p} = 0\\ \frac{\left(\sum_i^n y_i\right)}{p} &= \frac{\left(n - \sum_i^n y_i\right)}{1 - p} \\ \left(\sum_i^n y_i\right)(1 - p) &= \left(n - \sum_i^n y_i\right)p \\ \sum_i^n y_i - \sum_i^n y_ip &= np - \sum_i^n y_ip \\ \sum_i^n y_i &= np \\ \widehat p_{\tiny MLE} &= \frac{\sum_i^n y_i}{n} \end{aligned}

There’s a good reason the likelihood principle is so prominent — the likelihood function is very useful for guiding us toward good estimates of parameters for our model. However, finding an estimate for $\beta_0$ and $\beta_1$ is not as straightforward when we try to associate data with our logistic function, and no closed form solution exists like the simplified case.

Likelihood for Regression

The likelihood function for logistic regression is instead a function of the parameters $\beta_0, \beta_1$ . Instead of just the number of 1’s and 0’s we observe in the data $\bm y = y_1, y_2, \ldots, y_n$ , we must also take into consideration the values of $\bm x = x_1, x_2, \ldots, x_n$ that we observe. In a sense, the likelihood function transfers along to the next stage of modeling since $p_i$ itself is subsequently a function of $\beta_0, \beta_1$ and $x_i$ .The likelihood is given by substituting the logistic function for $p$ in the binary likelihood function. If we simplify the expression, we get

\begin{aligned} \ell(\beta_0, \beta_1; \bm x, \bm y) &= -\sum_i^n \log\left(1 + e^{\beta_0 + \beta_1 x_i}\right) + \sum_i^n y_i(\beta_0 + \beta_1 x_i) \end{aligned}

For a single $x_i, y_i$ , we have:

\begin{aligned} \ell(\beta_0, \beta_1;x_i, y_i) &= y_i \log p_i + (1 - y_i) \log (1 - p_i) \\ &= y_i \underbrace{\log \left(\frac{p_i}{1 - p_i}\right)}_{\text{log odds}} + \log (1 - p_i) \\ \end{aligned}

We found in the interpretation section that the log odds is the same as the linear predictor, $\log \frac{p_i}{1-p_i} = \beta_0 + \beta_1 x_i$ , and substituting with logistic function,

\begin{aligned} \ldots &= y_i (\beta_0 + \beta_1 x_i) + \log (1 - p_i) \\ &= y_i (\beta_0 + \beta_1 x_i) + \log(1 - \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_i)}}) \\ &= y_i (\beta_0 + \beta_1 x_i) + \log\left(\frac{(1 + e^{-(\beta_0 + \beta_1 x_i)}) - 1}{1 + e^{-(\beta_0 + \beta_1 x_i)}}\right) \\ &= y_i (\beta_0 + \beta_1 x_i) + \log\left(\frac{(e^{-(\beta_0 + \beta_1 x_i)})}{1 + e^{-(\beta_0 + \beta_1 x_i)}}\frac{(e^{\beta_0 + \beta_1 x_i})}{(e^{\beta_0 + \beta_1 x_i})}\right) \\ &= y_i (\beta_0 + \beta_1 x_i) - \log\left(1 + e^{\beta_0 + \beta_1 x_i}\right) \end{aligned}

For multiple observations, we just sum the log-likelihoods, and we get the expression shown.

It’s certainly not a friendly expression, and we are now dealing with two parameters $\beta_0, \beta_1$ instead of just a single $p$ , so we are optimizing over a surface instead of just a curve, but we can still visualize the likelihood.

Hauck-Donner Effect

Above we saw that the logistic function “disappears” when we separate the data. Can you rationalize why this happens as you play around with the likelihood plot and the logistic regression function? The maximum likelihood no longer exists (or not numerically stable) in these situations because the likelihood surface becomes very flat, and the coefficients escape off to infinity. We can make the magnitude of $\beta_1$ arbitrarily large, because the logistic function will become sharper and there will also be many values that $\beta_0$ can take on. This phenomenon is known as the Hauck-Donner effect, and happens often in logistic regression, especially in higher dimensional spaces. It’s not the only way this can happen, but it’s important to keep in mind as you diagnose possible issues with your model. One sign is that the standard errors of your $\beta$ coefficients will be very large.

When $\beta_1 \rightarrow \infty$ , the logistic function will start to look like a vertical line separating the points, while $\beta_0$ still controls the location of the separation. The reason the likelihood fails to give estimates in perfectly separated data is because it can’t choose between any of these possible values.

There are various ways of dealing with this issue, philosophical by nature:

If you’re using logistic regression for classification, you can choose any of the values and get the same classification. You can also consider using a simpler model, like a threshold on $x$ .
Change your model by including different or excluding variables. It’s possible that with this separation, you were asking an ill-posed question in the first place, and a different formulation should be considered.
You can also adjust the likelihood function with a penalty for large values of coefficients, commonly a ridge or lasso penalty. This is the approach of python’s scikit-learn implementation.
Changing to a bayesian framework, if we add a prior distribution on the coefficients, we can get a well defined posterior distribution. bayesglm in R package “arm” uses this approach.
Other solutions are listed on this stack exchange post.

Optimization Topics

Given that this is a relatively standard nonlinear (convex) optimization problem, you have a choice of many methods that will give you the same optimal answer. I won’t get too much into the nitty gritty, as chances are this will be mostly done already by a computing program. Probably the most common implementations are:

Gradient Descent — A general purpose descent algorithm that is normally used for simplicity. It’s not as efficient as other methods.
Newton Raphson — A faster algorithm that uses the second derivative of the likelihood function to find the optimal solution. However looking at the surface of the likelihood function, a quadratic approximation is not always reliable especially as the likelihood function becomes flatter.
Iteratively Reweighted Least Squares — Used for models in generalized linear models because each iteration is solving a least squares problem, which also contains useful statistical information by default. This method is the same as “Fisher-Scoring”, which is similar to Newton-Raphson, but instead uses the expected value of the second derivative of the likelihood. This is the method used in R’s glm.fit function.

If you come from machine learning or engineering disciplines, it’s common instead to think about minimizing loss functions instead of maximizing likelihood functions. There is a correspondence. In the case of logistic regression, maximizing the likelihood is equivalent to minimizing the cross-entropy of the fitted probabilities of our model and the data values. There is an additional factor of $\frac{1}{n}$ that does not change the results of the optimization.

Note there are other related parameterizations floating around, but the one I give is most common in statistics because the parameters are linearly related with each other. This allows the function to fit into the framework of Generalized Linear Model theory. ↩
As is convention in statistics, the capital $Y$ means we don’t know the value yet (is random, or not yet observed) but a lower case $y$ means that it’s known (fixed, or already observed). ↩
Odds do have a tremendous advantage in (mostly) epidemiological contexts, in that if you are trying to estimate the odds ratio of a disease conditional on some sort of exposure, an important property is that the odds ratio is symmetric with respect to the role of disease and exposure. This implies that you can estimate the odds of a disease conditional on exposure, the same that you would estimate exposure conditional on disease, i.e. you can use retrospective studies to estimate quantities you’d normally be interested in prospective studies! Note odds ratios are NOT the same as relative risk, another common quantity in epidemiology. ↩
Likelihood also differs from posterior probability in bayesian statistics. In a bayesian framework, we do think about parameters as being random values, and thus we are able to make probabilistic statements about them. Bayes Theorem describes the relationship between the prior probability, posterior probability and likelihood function. ↩