Binomial GLM

Psychological Sciences

Filippo Gambarota

University of Padova

Last modified: 02-09-2025

Example: Passing the exam

We want to measure the impact of watching tv-shows on the probability of passing the statistics exam.

exam: passing the exam (1 = “passed”, 0 = “failed”)
tv_shows: watching tv-shows regularly (1 = “yes”, 0 = “no”)

tv_shows	exam
1	1
1	1
1	1
1	1
...	...
0	0
0	0
0	0
0	0

Example: Passing the exam

We can create the contingency table

xtabs(~exam + tv_shows, data = dat) |> 
    addmargins()

#>      tv_shows
#> exam    0   1 Sum
#>   0    36  13  49
#>   1    14  37  51
#>   Sum  50  50 100

Example: Passing the exam

Each cell probability \(p_{ij}\) is computed as \(p_{ij}/n\)

(xtabs(~exam + tv_shows, data = dat)/n) |> 
    addmargins()

#>      tv_shows
#> exam     0    1  Sum
#>   0   0.36 0.13 0.49
#>   1   0.14 0.37 0.51
#>   Sum 0.50 0.50 1.00

Example: Passing the exam - Odds

The most common way to analyze a 2x2 contingency table is using the odds ratio (OR). Firsly let’s define the odds of success as:

\[\begin{align*} odds = \frac{p}{1 - p} \\ p = \frac{odds}{odds + 1} \end{align*}\]

the odds are non-negative, ranging between 0 and \(+\infty\)
an odds of e.g. 3 means that we expect 3 success for each failure

Example: Passing the exam - Odds

For the exam example:

odds <- function(p) p / (1 - p)
p11 <- mean(with(dat, exam[tv_shows == 1])) # passing exam | tv_shows
p11

#> [1] 0.74

odds(p11)

#> [1] 2.846154

Example: Passing the exam - Odds Ratio

The OR is a ratio of odds:

\[ OR = \frac{\frac{p_1}{1 - p_1}}{\frac{p_2}{1 - p_2}} \]

OR ranges between 0 and \(+\infty\). When \(OR = 1\) the odds for the two conditions are equal
An e.g. \(OR = 3\) means that being in the condition at the numerator increase 3 times the odds of success

Example: Passing the exam - Odds Ratio

odds_ratio <- function(p1, p2) odds(p1) / odds(p2)
p11 <- mean(with(dat, exam[tv_shows == 1])) # passing exam | tv_shows
p10 <- mean(with(dat, exam[tv_shows == 0])) # passing exam | not tv_shows
p11

#> [1] 0.74

p10

#> [1] 0.28

odds_ratio(p11, p10)

#> [1] 7.318681

Why using these measure?

The odds have an interesting property when taking the logarithm. We can express a probability \(p\) using a scale ranging \([-\infty, +\infty]\)

Another example, Teddy Child

We considered a Study conducted by the University of Padua (TEDDY Child Study, 2020)¹. Within the study, researchers asked the participants (mothers of a young child) about the presence of post-partum depression and measured the parental stress using the PSI-Parenting Stress Index.

ID	Parental_stress	Depression_pp
1	75	No
2	51	No
3	76	No
4	88	No
...	...	...
376	67	No
377	71	No
378	63	No
379	70	No

Another example, Teddy Child

We want to see if the parental stress increase the probability of having post-partum depression:

Another example, Teddy Child

Let’s start by fitting a linear model Depression_pp ~ Parental_stress. We consider “Yes” as 1 and “No” as 0.

fit_lm <- lm(Depression_pp01 ~ Parental_stress, data = teddy)
summary(fit_lm)

#> 
#> Call:
#> lm(formula = Depression_pp01 ~ Parental_stress, data = teddy)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.42473 -0.13768 -0.10003 -0.05768  0.94702 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     -0.172900   0.077561  -2.229 0.026389 *  
#> Parental_stress  0.004706   0.001201   3.919 0.000105 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3239 on 377 degrees of freedom
#> Multiple R-squared:  0.03915,    Adjusted R-squared:  0.0366 
#> F-statistic: 15.36 on 1 and 377 DF,  p-value: 0.0001054

Another example, Teddy Child

Let’s add the fitted line to our plot:

Another example, Teddy Child

… and check the residuals, pretty bad right?

Another example, Teddy Child

As for the exam example, we could compute a sort of contingency table despite the Parental_stress is a numerical variable by creating some discrete categories (just for exploratory analysis):

table(teddy$Depression_pp, teddy$Parental_stress_c) |> 
    round(2)

#>      
#>       < 40 40-60 60-80 80-100 > 100
#>   No     0   164   136     26     6
#>   Yes    0    15    21      7     4

table(teddy$Depression_pp, teddy$Parental_stress_c) |> 
    prop.table(margin = 2) |> 
    round(2)

#>      
#>       < 40 40-60 60-80 80-100 > 100
#>   No        0.92  0.87   0.79  0.60
#>   Yes       0.08  0.13   0.21  0.40

Another example, Teddy Child

Ideally, we could compute the increase in the odds of having the post-partum depression as the parental stress increase. In fact, as we are going to see, the Binomial GLM is able to estimate the non-linear increase in the probability.

Binomial GLM

The random component of a Binomial GLM the binomial distribution with parameter \(p\)
The systematic component is a linear combination of predictors and coefficients \(\boldsymbol{\beta X}\)
The link function is a function that map probabilities into the \([-\infty, +\infty]\) range.

Logit Link

The logit link is the most common link function when using a binomial GLM:

\[ log \left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} \]

The inverse of the logit maps again the probability into the \([0, 1]\) range:

\[ p = \frac{e^{\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip}}}{1 + e^{\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip}}} \]

Logit Link

Thus with a single numerical predictor \(x\) the relationship between \(x\) and \(p\) in non-linear on the probability scale but linear on the logit scale.

Logit Link

The problem is that effects are non-linear, thus is more difficult to interpret and report model results

Model fitting in R

The big picture…

Model fitting in R

We can model the contingency table presented before. We put data in binary form:

#>     tv_shows
#> exam  0  1
#>    0 35 22
#>    1 15 28

#> # A tibble: 9 × 2
#>   tv_shows exam 
#>   <chr>    <chr>
#> 1 1        0    
#> 2 1        1    
#> 3 1        1    
#> 4 1        1    
#> 5 ...      ...  
#> 6 0        0    
#> 7 0        1    
#> 8 0        0    
#> 9 0        0

Intercept only model

Let’s start from the simplest model (often called null model) where there are no predictors:

fit0 <- glm(exam ~ 1, data = dat, family = binomial(link = "logit"))
summary(fit0)

#> 
#> Call:
#> glm(formula = exam ~ 1, family = binomial(link = "logit"), data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -0.2819     0.2020  -1.395    0.163
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 136.66  on 99  degrees of freedom
#> Residual deviance: 136.66  on 99  degrees of freedom
#> AIC: 138.66
#> 
#> Number of Fisher Scoring iterations: 4

Intercept only model

When fitting an intercept-only model, the parameter is the average value of the y variable:

\[\begin{align*} \log(\frac{p}{1 - p}) = \beta_0 \\ p = \frac{e^{\beta_0}}{1 + e^{\beta_0}} \end{align*}\]

Intercept only model

In R, the \(logit(p)\) is computed using qlogis() that is the q + logis combination of functions to work with probability distributions. The \(logit^{-1}\) thus the inverse of the logit function is plogis():

# average y on the respo nse scale
mean(dat$exam)

#> [1] 0.43

c("logit" = coef(fit0)[1],
  "inv-logit" = plogis(coef(fit0)[1])
)

#>     logit.(Intercept) inv-logit.(Intercept) 
#>            -0.2818512             0.4300000

Link function (TIPS)

If you are not sure about how to transform using the link function you can directly access the family() object in R that contains the appropriate link function and the corresponding inverse.

bin <- binomial(link = "logit")
bin$linkfun() # the same as qlogis
bin$linkinv() # the same as plogis

Model with \(X\)

Now we can add the tv_shows predictor. Now the model has two coefficients. Given that the tv_shows is a binary variable, the intercept is the average y when tv_shows is 0, and the tv_shows coefficient is the increase in y for a unit increase in tv_shows:

fit <- glm(exam ~ tv_shows, data = dat, family = binomial(link = "logit"))
summary(fit)

#> 
#> Call:
#> glm(formula = exam ~ tv_shows, family = binomial(link = "logit"), 
#>     data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)   
#> (Intercept)  -0.8473     0.3086  -2.746  0.00604 **
#> tv_shows      1.0885     0.4200   2.592  0.00956 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 136.66  on 99  degrees of freedom
#> Residual deviance: 129.68  on 98  degrees of freedom
#> AIC: 133.68
#> 
#> Number of Fisher Scoring iterations: 4

Model with \(X\)

Thinking about our data, the (Intercept) is the probability of passing the exam without watching tv-shows. The tv_shows (should be) the difference in the probability of passing the exam between people who watched or did not watched tv-shows, BUT:

we are on the logit scale. Thus we are modelling log(odds) and not probabilities
a difference on the log scale is a ratio on the raw scale. Thus taking the exponential of tv_shows we obtain the ratio of odds of passing the exam watching vs non-watching tv-shows. Do you remember something?

Model with \(X\)

The tv_shows is exactly the Odds Ratio that we calculated on the contingency table:

# from model coefficients
exp(coef(fit)["tv_shows"])

#> tv_shows 
#> 2.969697

# from the contingency table
odds_ratio(mean(dat$exam[dat$tv_shows == 1]),
           mean(dat$exam[dat$tv_shows == 0]))

#> [1] 2.969697

Parameter Intepretation

Model Intepretation

Given the non-linearity and the link function, parameter intepretation is not easy for GLMs. In the case of the Binomial GLM we will see:

interpreting model coefficients on the linear and logit scale
odds ratio (already introduced)
the divide by 4 rule (Gelman et al., 2020; Gelman & Hill, 2006)
marginal effects
predicted probabilities

Some nice resources:

(https://www.polyu.edu.hk/cbs/sjpolit/logisticregression.html)[https://www.polyu.edu.hk/cbs/sjpolit/logisticregression.html#:~:text=The%20divide%2Dby%2D4%20rule,estimate%20of%20its%20marginal%20effect.]

`marginaleffects` package

The marginaleffects package provide a very complete and comprehensive framework to compute marginal effects for several models. You can have a very detailed overview of the theory and the functions reading:

The main paper by Arel-Bundock et al. (2024)
https://www.youtube.com/watch?v=ANDC_kkAjeM

`marginaleffects` package

Arel-Bundock et al. (2024) provided an overview about intepreting effects from a fitted model, these are the main elements:

Quantity: What is the quantity of interest? Do we want to report a prediction or a function of predictions (average, difference, ratio, derivative, etc.)?
- predictions
- comparisons (e.g., contrasts)
- slopes (i.e., derivatives)
Grid: What predictor values are we interested in? Do we want to report estimates for the units in our dataset, or for hypothetical or representative individuals?
Aggregation: Do we report estimates for every observation in the grid or a global summary?
Uncertainty: How do we quantify uncertainty about our estimates?
Test: Which (non-)linear hypothesis or equivalence tests do we conduct?

We will not see everything, but I strongly suggest you to read the paper and try the package!

`marginaleffects` vs `emmeans`

emmeans is another very popular package for computing “estimated marginal means”. They provide similar functionalities but emmeans is used more for balanced dataset usually from experimental research (e.g., balanced factorial designs) while marginaleffects is used more in observational research using more complex models especially with numerical variables.

Intepreting model coefficients

Models coefficients are interpreted in the same way as standard regression models. The big difference concerns:

numerical predictors
categorical predictors

Using contrast coding and centering/standardizing we can make model coefficients easier to intepret.

Categorical predictors

We use a categorical predictor with \(p\) levels, the model will estimate \(p - 1\) parameters. The interpretation of these parameters is controlled by the contrast coding. In R the default is the treatment coding (or dummy coding). Essentially R create \(p - 1\) dummy variables (0 and 1) where 0 is the reference level (usually the first category) and 1 is the current level. We can see the coding scheme using the model.matrix() function that return the \(\boldsymbol{X}\) matrix:

#> # A tibble: 9 × 2
#>   X.Intercept. tv_shows
#>   <chr>        <chr>   
#> 1 1            1       
#> 2 1            1       
#> 3 1            1       
#> 4 1            1       
#> 5 ...          ...     
#> 6 1            0       
#> 7 1            0       
#> 8 1            0       
#> 9 1            0

Categorical predictors

In the simple case of the exam dataset, the intercept (\(\beta_0\)) is the reference level (default to 0 because is the first) and \(\beta_0\) is the difference between the actual level and the reference level. If we change the order of the levels we could change the intercept value while \(\beta_1\) will be the same. As an example we could use the so-called sum to zero coding where instead of assigning 0 and 1 we use different values. For example assigning -0.5 and 0.5 will make the intercept the grand-mean:

dat$tv_shows0 <- ifelse(dat$tv_shows == 0, -0.5, 0.5)
fit <- glm(exam ~ tv_shows0, data = dat, family = binomial(link = "logit"))
# grand mean
mean(c(mean(dat$exam[dat$tv_shows == 1]), mean(dat$exam[dat$tv_shows == 0])))

#> [1] 0.43

# intercept
plogis(coef(fit)[1])

#> (Intercept) 
#>   0.4248077

Categorical predictors

Numerical predictors

With numerical predictors the idea is the same as categorical predictors. In fact, categorical predictors are converted into numbers (i.e., 0 and 1 or -0.5 and 0.5). The only caveat is that the effects are linear only the logit scale. Thus \(\beta_1\) is interpreted in the same way as standard linear models only on the link-function scale. For the binomial

GLM the \(\beta_1\) is the increase in the \(log(odds(p))\) for a unit-increase in the \(x\). In the response (probability) scale, the \(\beta_1\) is the multiplicative increase in the odds of \(y = 1\) for a unit increase in the predictor.

Numerical predictors

With numerical predictors we could mean-center and or standardize the predictors. With centering (similarly to the categorical example) we change the interpretation of the intercept. Standardizing is helpful to have more meaningful \(\beta\) values. The \(\beta_i\) of a centered predictor is the increase in \(y\) for a increase in one standard deviation of \(x\).

\[ x_{cen} = x_i - \hat x \]

\[ x_{z} = \frac{x_i - \hat x}{\sigma_x} \]

Numerical predictors

Let’s return to our teddy child example and fitting the proper model:

fit_glm <- glm(Depression_pp01 ~ Parental_stress, data = teddy, family = binomial(link = "logit"))
summary(fit_glm)

#> 
#> Call:
#> glm(formula = Depression_pp01 ~ Parental_stress, family = binomial(link = "logit"), 
#>     data = teddy)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -4.323906   0.690689  -6.260 3.84e-10 ***
#> Parental_stress  0.036015   0.009838   3.661 0.000251 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 284.13  on 378  degrees of freedom
#> Residual deviance: 271.23  on 377  degrees of freedom
#> AIC: 275.23
#> 
#> Number of Fisher Scoring iterations: 5

Numerical predictors

The (Intercept) (\(\beta_0\)) is the probability of having post-partum depression for a mother with parental stress zero (maybe better centering?)

\[ p(yes|x = 0) = g^{-1}(\beta_0) \]

plogis(coef(fit_glm)["(Intercept)"])

#> (Intercept) 
#>  0.01307482

Numerical predictors

The Parental_stress (\(\beta_1\)) is the increase in the \(log(odds)\) of having the post-partum depression for a unit increase in the parental stress index. If we take the exponential of \(\beta_1\) we obtain the increase in the \(odds\) of having post-partum depression for a unit increase in parental stress index.

exp(coef(fit_glm)["Parental_stress"])

#> Parental_stress 
#>        1.036671

Numerical predictors

The problem is that, as shown before, the effects are non-linear on the probability scale while are linear on the logit scale. On the logit scale, all differences are constant:

pr <- list(c(10, 11), c(50, 51), c(70, 71))

predictions <- lapply(pr, function(x) {
    predict(fit_glm, newdata = data.frame(Parental_stress = x))
})

# notice that the difference is exactly the Parental_stress parameter
sapply(predictions, diff)

#>         2         2         2 
#> 0.0360147 0.0360147 0.0360147

Numerical predictors

While on the probability scale, the differences are not the same. This is problematic when interpreting the results of a Binomial GLM with a numerical predictor.

(predictions <- lapply(predictions, plogis))

#> [[1]]
#>          1          2 
#> 0.01863764 0.01930790 
#> 
#> [[2]]
#>          1          2 
#> 0.07424969 0.07676350 
#> 
#> [[3]]
#>         1         2 
#> 0.1415012 0.1459330

sapply(predictions, diff)

#>            2            2            2 
#> 0.0006702661 0.0025138036 0.0044317558

Predicted probabilities

In a similar way we can use the inverse logit function to find the predicted probability specific values of \(x\). For example, the difference between \(p(y = 1|x = 2.5) - p(y = 1|x = 5)\) can be calculated using the model equation:

\(p(y = 1|x = 2.5) = \frac{e^{\beta_0 + \beta_1 2.5}}{1 + e^{\beta_0 + \beta_1 2.5}}\)
\(p(y = 1|x = 5) = \frac{e^{\beta_0 + \beta_1 5}}{1 + e^{\beta_0 + \beta_1 5}}\)
\(p(y = 1|x = 5) - p(y = 1|x = 2.5)\)

coefs <- coef(fit)
plogis(coefs[1] + coefs[2]*5) - plogis(coefs[1] + coefs[2]*2.5)

#>  (Intercept) 
#> 4.642617e-10

Predicted probabilities

In R we can use directly the predict() function with the argument type = "response" to return the predicted probabilities instead of the logits:

preds <- predict(fit, newdata = list(x = c(2.5, 5)), type = "response")
preds

#> 1 2 
#> 1 1

preds[2] - preds[1]

#>            2 
#> 4.642615e-10

Marginal effects

A marginal effect measures the association between a change in a predictor (\(x\)), and a change in the response \(y\).

The slope can be evaluated for any level of \(x\). In fact, the divide-by-4 rule is the maximal slope evaluated at a specific \(x\).

Marginal effects

The divide-by-4 rule is the maximal marginal effect. The average marginal effects is the average of all slopes. This can be easily done with the marginaleffects package:

library(marginaleffects)

# all marginal effects
slopes(fit)
#> 
#>  Estimate Std. Error     z Pr(>|z|)    S   2.5 % 97.5 %
#>    1.8970     0.3614 5.249  < 0.001 22.6  1.1887 2.6053
#>    0.0303     0.0330 0.918  0.35857  1.5 -0.0343 0.0949
#>    2.2041     0.4176 5.278  < 0.001 22.9  1.3856 3.0227
#>    2.3011     0.4014 5.732  < 0.001 26.6  1.5143 3.0879
#>    0.0776     0.0693 1.119  0.26293  1.9 -0.0583 0.2135
#> --- 90 rows omitted. See ?print.marginaleffects ---
#>    0.0187     0.0223 0.842  0.39982  1.3 -0.0249 0.0624
#>    0.2091     0.1422 1.471  0.14137  2.8 -0.0696 0.4878
#>    0.0130     0.0164 0.792  0.42859  1.2 -0.0191 0.0451
#>    2.2424     0.3895 5.758  < 0.001 26.8  1.4791 3.0058
#>    0.8147     0.2690 3.028  0.00246  8.7  0.2874 1.3420
#> Term: x
#> Type: response
#> Comparison: dY/dX

Marginal effects

# marginal effects when x = 5
slopes(fit, newdata = data.frame(x = 5))

#> 
#>  Estimate Std. Error  z Pr(>|z|)  S 2.5 % 97.5 % x
#>         0         NA NA       NA NA    NA     NA 5
#> 
#> Term: x
#> Type: response
#> Comparison: dY/dX

# average marginal effect
avg_slopes(fit)

#> 
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
#>      1.19     0.0629 18.9   <0.001 261.6  1.06   1.31
#> 
#> Term: x
#> Type: response
#> Comparison: dY/dX

Marginal effects

With multiple variables, marginal effects are also useful to see the effect of one predictor fixing the level of others. Let’s simulate a model with two predictors:

dat <- sim_design(100, nx = list(x1 = runif(100), x2 = runif(100)))
dat <- sim_data(dat, plogis(qlogis(0.001) + 5 * x1 + 2 * x2), model = "binomial")
fit <- glm(y ~ x1 + x2, data = dat, family = binomial(link = "logit"))
plot(ggeffect(fit)$x1) | plot(ggeffect(fit)$x2)

Marginal effects

We can see the marginal effects (each, average or marginal) for the \(x1\) while fixing \(x2\) to the mean:

slopes(fit, newdata = data.frame(x1 = dat$x1, x2 = mean(dat$x2)))

#> 
#>  Term Estimate Std. Error     z Pr(>|z|)   S   2.5 % 97.5 %     x1    x2
#>    x1   0.0220     0.0174 1.266    0.205 2.3 -0.0120  0.056 0.0285 0.504
#>    x1   0.0552     0.1060 0.520    0.603 0.7 -0.1526  0.263 0.7780 0.504
#>    x1   0.0652     0.1396 0.467    0.640 0.6 -0.2083  0.339 0.9199 0.504
#>    x1   0.0675     0.1474 0.458    0.647 0.6 -0.2214  0.356 0.9493 0.504
#>    x1   0.0479     0.0830 0.577    0.564 0.8 -0.1148  0.211 0.6593 0.504
#> --- 190 rows omitted. See ?print.marginaleffects ---
#>    x2   0.0522     0.0510 1.023    0.306 1.7 -0.0478  0.152 0.1950 0.504
#>    x2   0.0683     0.0510 1.339    0.181 2.5 -0.0317  0.168 0.4131 0.504
#>    x2   0.0565     0.0507 1.115    0.265 1.9 -0.0428  0.156 0.2594 0.504
#>    x2   0.1376     0.1329 1.035    0.301 1.7 -0.1229  0.398 0.9986 0.504
#>    x2   0.0834     0.0569 1.466    0.143 2.8 -0.0281  0.195 0.5758 0.504
#> Type: response
#> Comparison: dY/dX

Divide by 4 rule

The divide by 4 rule is a very easy way to evaluate the effect of a continuous predictor.

Given the non-linearity, the derivative of the logistic function (i.e., the slope) is maximal for probabilities around ~\(0.5\).

In fact, \(\beta_i p (1 - p)\) is maximized when \(p = 0.5\) turning into \(\beta_i 0.25\) (i.e., dividing by 4).

Dividing \(\beta/4\) we obtain the maximal slope thus the maximal difference in probability.

Divide by 4 rule

Misleading divide by 4 rule

The divide by 4 rule can be misleading where at 50% accuracy we have few or no data.

#> 
#> Call:  glm(formula = Depression_pp01 ~ Parental_stress, family = binomial(link = "logit"), 
#>     data = teddy)
#> 
#> Coefficients:
#>     (Intercept)  Parental_stress  
#>        -4.32391          0.03601  
#> 
#> Degrees of Freedom: 378 Total (i.e. Null);  377 Residual
#> Null Deviance:       284.1 
#> Residual Deviance: 271.2     AIC: 275.2

Misleading divide by 4 rule

For example, in the TeddyChild dataset, the relationship between post-partum depression and parental stress cover a wide range but there are few observation around \(y = 50\%\):

library(effects)
plot(allEffects(fit_glm))

Inverse estimation²

Sometimes it is useful to do what is called inverse estimation, thus predicting the \(x\) level associated with a certain \(y\). In this case the \(x\) producing a certain \(p\) of response.

Sometimes this is called median effective dose when finding the \(x\) level producing \(50\%\) of correct responses.

We can use the MASS::dose.p() function:

MASS::dose.p(fit, p = c(0.25, 0.5, 0.8))

#>               Dose        SE
#> p = 0.25: 4.673415 0.3921583
#> p = 0.50: 6.016556 0.3553563
#> p = 0.80: 7.711411 0.5399785

Clearly, the range of the predictor need to be meaningful. Predicting the probability of post partum depression being 0.9 is probably meaningless given the bounded nature of the predictor variable.

Inverse estimation

Still using the TeddyChild dataset, an application could be to estimate the level of parental stress required to have a certain probability of post-partum depression. For example:

MASS::dose.p(fit_glm, p = c(0.5, 0.7, 0.8))

#>              Dose       SE
#> p = 0.5: 120.0595 14.80923
#> p = 0.7: 143.5859 21.03040
#> p = 0.8: 158.5519 25.04193

Thus with a

Inverse estimation and Psychophysics

In Psychophysics it is common to do the inverse estimation to estimate the detection or performance threshold. For example, estimating the \(x\) level associated with a certain probability of success e.g. 50%.

Code

curve(plogis(x, 0.7, 1/8), 
      ylim = c(0, 1),
      xlab = "Stimulus Contrast",
      ylab = "Probability of Detection",
      lwd = 2)
segments(x0 = 0.7, y0 = 0, x1 = 0.7, 0.5, lty = 2)
segments(x0 = 0, y0 = 0.5, x1 = 0.7, 0.5, lty = 2)
points(x = 0.7, y = 0.5, pch = 19, col = "firebrick", cex = 2)

Inverse estimation and Psychophysics

Let’s simulate and ideal observer that respond to stimuli with different contrast:

Code

th <- 0.7 # -(b0/b1)
slope <- 1/8

b1 <- 1/slope
b0 <- -th*b1

x <- rep(seq(0, 1, 0.1), each = 20)

p <- plogis(b0 + b1 * x)
y <- rbinom(length(x), 1, p)

yp <- tapply(y, x, mean)
xc <- unique(x)

plot(xc, yp, type = "b", ylim = c(0, 1), xlab = "Contrast", ylab = "Probability of Detection")
curve(plogis(x, th, slope), add = TRUE, col = "firebrick", lwd = 2)

Inverse estimation and Psychophysics

fit <- glm(y ~ x, family = binomial(link = "logit"))
summary(fit)

#> 
#> Call:
#> glm(formula = y ~ x, family = binomial(link = "logit"))
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -5.3958     0.7164  -7.532 4.98e-14 ***
#> x             8.3175     1.0833   7.678 1.62e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 289.51  on 219  degrees of freedom
#> Residual deviance: 148.28  on 218  degrees of freedom
#> AIC: 152.28
#> 
#> Number of Fisher Scoring iterations: 6

Inverse estimation and Psychophysics

We can estimate the threshold and slope of the psychometric function using the equations from Knoblauch & Maloney (2012).

-(coef(fit)[1]/coef(fit)[2]) # threshold

#> (Intercept) 
#>    0.648735

MASS::dose.p(fit, p = 0.5) # with inverse estimation

#>              Dose         SE
#> p = 0.5: 0.648735 0.02506184

Odds Ratio (OR) to Cohen’s \(d\)

The Odds Ratio can be considered an effect size measure. We can transform the OR into other effect size measure (Borenstein et al., 2009; Sánchez-Meca et al., 2003).

\[ d = \log(OR) \frac{\sqrt{3}}{\pi} \]

# in R with the effectsize package
or <- 1.5
effectsize::logoddsratio_to_d(log(or))

#> [1] 0.2235446

# or 
effectsize::oddsratio_to_d(or)

#> [1] 0.2235446

Odds Ratio (OR) to Cohen’s \(d\)

Inference

Wald tests

The basic approach when doing inference with GLM is interpreting the Wald test of each model coefficients. The Wald test is calculated as follows:

\[ z = \frac{\beta_p - \beta_0}{\sigma_{\beta_p}} \]

Where \(\beta\) is the regression coefficients, \(\beta_0\) is the value under the null hypothesis (generally 0) and \(\sigma_{\beta_p}\) is the standard error. P-values and confidence intervals can be calculated based on a standard normal distribution.

Wald-type confidence intervals

Wald-type confidence interval (directly from model summary), where \(\Phi\) is the cumulative Gaussian function qnorm():

\[ 95\%CI = \hat \beta \pm \Phi(\alpha/2) SE_{\beta} \]

(summ <- data.frame(summary(fit)$coefficients))

#> # A tibble: 2 × 4
#>   Estimate Std..Error z.value Pr...z..
#>      <dbl>      <dbl>   <dbl>    <dbl>
#> 1   -0.944      0.315   -3.00 0.00271 
#> 2    1.52       0.431    3.52 0.000425

# 95% confidence interval
summ$Estimate + qnorm(c(0.025, 0.95))*summ$Std..Error

#> [1] -1.561792  2.229236

Profile likelihood confidence intervals

The computation is a little bit different and they are not always symmetric:

# profile likelihood, different from wald type
confint(fit)

#>                 2.5 %     97.5 %
#> (Intercept) -1.595244 -0.3501169
#> tv_shows     0.693377  2.3909084

Profile likelihood confidence intervals

You can use the plot_param() function (quite overkilled):

plot_param(fit, "tv_shows")

Confidence intervals

When calculating confidence intervals it is important to consider the link function. In the same way as we compute the inverse logit function on the parameter value, we could revert also the confidence intervals. IMPORTANT, do not apply the inverse logit on the standard error and then compute the confidence interval.

fits <- broom::tidy(fit) # extract parameters as dataframe
fits

#> # A tibble: 2 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)   -0.490     0.291     -1.68 0.0929 
#> 2 tv_shows       1.15      0.417      2.76 0.00572

Confidence intervals³

b <- fits$estimate[2]
se <- fits$std.error[2]

# correct, wald-type confidence intervals
c(b = exp(b), lower = exp(b - 2*se), upper = exp(b + 2*se))

#>        b    lower    upper 
#> 3.167183 1.375115 7.294695

# correct, likelihood based confidence intervals
exp(confint(fit, "tv_shows"))

#>    2.5 %   97.5 % 
#> 1.416352 7.309771

# wrong wald type
c(b = exp(b), lower = exp(b) - 2*exp(se), upper = exp(b) + 2*exp(se))

#>         b     lower     upper 
#> 3.1671827 0.1319149 6.2024504

Confidence intervals

The same principle holds for predicted probabilities. First compute the intervals on the logit scale and then transform-back on the probability scale:

fits <- dat |> 
    select(tv_shows) |> 
    distinct() |> 
    add_predict(fit, se.fit = TRUE)

fits$p <- plogis(fits$fit)
fits$lower <- plogis(with(fits, fit - 2*se.fit))
fits$upper <- plogis(with(fits, fit + 2*se.fit))

fits

#> # A tibble: 2 × 7
#>   tv_shows    fit se.fit residual.scale     p lower upper
#>      <dbl>  <dbl>  <dbl>          <dbl> <dbl> <dbl> <dbl>
#> 1        1  0.663  0.299              1 0.660 0.517 0.779
#> 2        0 -0.490  0.291              1 0.380 0.255 0.523

Anova

With multiple predictors, especially with categorical variables with more than 2 levels, we can compute the an anova-like analysis individuating the effect of each predictor. In R we can do this using the car::Anova() function. Let’s simulate a model with a 2x2 interaction:

#> # A tibble: 6 × 4
#>      id x1    x2        y
#>   <int> <fct> <fct> <int>
#> 1     1 a     c         1
#> 2     2 a     d         0
#> 3     3 b     c         0
#> 4     4 b     d         0
#> 5     5 a     c         0
#> 6     6 a     d         0

Anova

We can fit the most complex model containing the two main effects and the interaction. I set the contrasts for the two factors as contr.sum()/2 that are required for a proper analysis of factorial designs:

fit_max <- glm(y ~ x1 + x2 + x1:x2, data = dat, family = binomial(link = "logit")) # same as x1 * x2
summary(fit_max)

#> 
#> Call:
#> glm(formula = y ~ x1 + x2 + x1:x2, family = binomial(link = "logit"), 
#>     data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -1.0692     0.2321  -4.606  4.1e-06 ***
#> x11           0.7520     0.4642   1.620  0.10523    
#> x21           1.4452     0.4642   3.113  0.00185 ** 
#> x11:x21      -0.1178     0.9284  -0.127  0.89905    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 143.06  on 119  degrees of freedom
#> Residual deviance: 129.31  on 116  degrees of freedom
#> AIC: 137.31
#> 
#> Number of Fisher Scoring iterations: 4

Anova

plot(ggeffect(fit_max))

#> $x1

#> 
#> $x2

Anova

Then using car::Anova(). For each effect we have the \(\chi^2\) statistics and the associated p-value. The null hypothesis is that the specific factor did not contribute in reducing the residual deviance.

car::Anova(fit_max)

#> # A tibble: 3 × 3
#>   `LR Chisq`    Df `Pr(>Chisq)`
#>        <dbl> <dbl>        <dbl>
#> 1     2.90       1     0.0884  
#> 2    11.1        1     0.000869
#> 3     0.0162     1     0.899

Model comparison

The table obtained with car::Anova() is essentially a model comparison using the Likelihood Ratio test. This can be done using the anova(...) function.

\[\begin{align*} D = 2 (log(\mathcal{L}_{full}) - log(\mathcal{L}_{reduced})) \\ D \sim \chi^2_{df_{full} - df_{reduced}} \end{align*}\]

Model comparison

To better understanding, the x2 effect reported in the car::Anova() table is a model comparison between a model with y ~ x1 + x2 and a model without x2. The difference between these two model is the unique contribution of x2 after controlling for x1:

fit <- glm(y ~ x1 + x2, data = dat, family = binomial(link = "logit"))
fit0 <- glm(y ~ x1, data = dat, family = binomial(link = "logit"))

anova(fit0, fit, test = "LRT") # ~ same as car::Anova(fit_max)

#> # A tibble: 2 × 5
#>   `Resid. Df` `Resid. Dev`    Df Deviance `Pr(>Chi)`
#>         <dbl>        <dbl> <dbl>    <dbl>      <dbl>
#> 1         118         140.    NA     NA    NA       
#> 2         117         129.     1     11.1   0.000869

Model comparison

The model comparison using anova() (i.e., likelihood ratio tests) is limited to nested models thus models that differs only for one term. For example:

fit1 <- glm(y ~ x1, data = dat, family = binomial(link = "logit"))
fit2 <- glm(y ~ x2, data = dat, family = binomial(link = "logit"))
fit3 <- glm(y ~ x1 + x2, data = dat, family = binomial(link = "logit"))

fit1 and fit2 are non-nested because we have the same number of predictors (thus degrees of freedom). fit3 and fit1/fit2 are nested because fit3 is more complex and removing one term we can obtain the less complex models.

Model comparison

anova(fit1, fit2, test = "LRT") # do not works properly

#> # A tibble: 2 × 5
#>   `Resid. Df` `Resid. Dev`    Df Deviance `Pr(>Chi)`
#>         <dbl>        <dbl> <dbl>    <dbl>      <dbl>
#> 1         118         140.    NA    NA            NA
#> 2         118         132.     0     8.18         NA

anova(fit1, fit3, test = "LRT") # same anova(fit2, fit3)

#> # A tibble: 2 × 5
#>   `Resid. Df` `Resid. Dev`    Df Deviance `Pr(>Chi)`
#>         <dbl>        <dbl> <dbl>    <dbl>      <dbl>
#> 1         118         140.    NA     NA    NA       
#> 2         117         129.     1     11.1   0.000869

Information Criteria

As for standard linear models I can use the Akaike Information Criteria (AIC) or the Bayesian Information Criteria (BIC) to compare non-nested models. The downside is not having a properly hypothesis testing setup.

data.frame(BIC(fit1, fit2, fit3)) |> 
    arrange(BIC)

#> # A tibble: 3 × 2
#>      df   BIC
#>   <dbl> <dbl>
#> 1     2  142.
#> 2     3  144.
#> 3     2  150.

data.frame(AIC(fit1, fit2, fit3)) |> 
    arrange(AIC)

#> # A tibble: 3 × 2
#>      df   AIC
#>   <dbl> <dbl>
#> 1     3  135.
#> 2     2  136.
#> 3     2  144.

Marginal Means

For post-hoc contrasts you can use the emmeans package both for numerical but especially categorical predictors. Let’s simulate a 2x2 interaction:

dat <- sim_design(100, cx = list(x1 = c("a", "b"), x2 = c("c", "d")), contrasts = contr.sum2)
dat <- sim_data(dat, plogis(qlogis(0.25) + 0 * x1_c + 2 * x2_c + 2 * x1_c * x2_c))
fit <- glm(y ~ x1 * x2, data = dat, family = binomial(link = "logit"))
summary(fit)

#> 
#> Call:
#> glm(formula = y ~ x1 * x2, family = binomial(link = "logit"), 
#>     data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -1.2944     0.1775  -7.290 3.09e-13 ***
#> x11          -0.3551     0.3551  -1.000    0.317    
#> x21           2.5456     0.3551   7.169 7.56e-13 ***
#> x11:x21       2.9255     0.7102   4.119 3.80e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 485.25  on 399  degrees of freedom
#> Residual deviance: 377.36  on 396  degrees of freedom
#> AIC: 385.36
#> 
#> Number of Fisher Scoring iterations: 6

Marginal Means

plot(ggeffects::ggeffect(fit, terms = c("x1", "x2")))

Marginal Means

emmeans is a very powerful and complicate package (documentation). Firstly we can estimate the marginal means of the conditions:

library(emmeans)
emmeans(fit, ~ x1 + x2)

#>  x1 x2 emmean    SE  df asymp.LCL asymp.UCL
#>  a  c   0.532 0.207 Inf     0.126     0.938
#>  b  c  -0.575 0.208 Inf    -0.984    -0.167
#>  a  d  -3.476 0.586 Inf    -4.625    -2.327
#>  b  d  -1.658 0.273 Inf    -2.193    -1.124
#> 
#> Results are given on the logit (not the response) scale. 
#> Confidence level used: 0.95

Marginal Means

We can also calculate the marginal means on the response scale:

library(emmeans)
emmeans(fit, ~ x1 + x2, type = "response")

#>  x1 x2 prob     SE  df asymp.LCL asymp.UCL
#>  a  c  0.63 0.0483 Inf   0.53152    0.7187
#>  b  c  0.36 0.0480 Inf   0.27216    0.4583
#>  a  d  0.03 0.0171 Inf   0.00971    0.0889
#>  b  d  0.16 0.0367 Inf   0.10039    0.2453
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale

Marginal Means

Crucially we can compute the contrasts (i.e., the post-hoc tests). Notice that they are computed on the link-function scale:

emmeans(fit, pairwise ~ x1 | x2)$contrasts
#> x2 = c:
#>  contrast estimate    SE  df z.ratio p.value
#>  a - b        1.11 0.294 Inf   3.770  0.0002
#> 
#> x2 = d:
#>  contrast estimate    SE  df z.ratio p.value
#>  a - b       -1.82 0.647 Inf  -2.812  0.0049
#> 
#> Results are given on the log odds ratio (not the response) scale.
emmeans(fit, pairwise ~ x1 | x2, type = "response")$contrasts
#> x2 = c:
#>  contrast odds.ratio    SE  df null z.ratio p.value
#>  a / b         3.027 0.889 Inf    1   3.770  0.0002
#> 
#> x2 = d:
#>  contrast odds.ratio    SE  df null z.ratio p.value
#>  a / b         0.162 0.105 Inf    1  -2.812  0.0049
#> 
#> Tests are performed on the log odds ratio scale

Marginal Means

Notice the order of the terms to change the actual type of contrasts:

emmeans(fit, pairwise ~ x2 | x1)$contrasts
#> x1 = a:
#>  contrast estimate    SE  df z.ratio p.value
#>  c - d        4.01 0.622 Inf   6.447  <.0001
#> 
#> x1 = b:
#>  contrast estimate    SE  df z.ratio p.value
#>  c - d        1.08 0.343 Inf   3.155  0.0016
#> 
#> Results are given on the log odds ratio (not the response) scale.
emmeans(fit, pairwise ~ x2 | x1, type = "response")$contrasts
#> x1 = a:
#>  contrast odds.ratio    SE  df null z.ratio p.value
#>  c / d         55.05 34.20 Inf    1   6.447  <.0001
#> 
#> x1 = b:
#>  contrast odds.ratio    SE  df null z.ratio p.value
#>  c / d          2.95  1.01 Inf    1   3.155  0.0016
#> 
#> Tests are performed on the log odds ratio scale

Plotting effects

Marginal effects

When plotting a binomial GLM the most useful way is plotting the marginal probabilities and standard errors/confidence intervals for a given combination of predictors. Let’s make an example for:

simple GLM with 1 categorical/numerical predictor
GLM with 2 numerical/categorical predictors
GLM with interaction between numerical and categorical predictors

Marginal effects

A general workflow could be:

fit the model
use the marginaleffects package or the predict() function
calculating the confidence intervals
plotting the results

Everything can be simplified using some packages to perform each step and returning a plot:

allEffects() from the effects() package (return a base R plot)
ggeffect() from the ggeffect() package (return a ggplot2 object)
plot_model from the sjPlot package (similar to ggeffect())

1 categorical predictor

In this situation we can just plot the marginal probabilities for each level of the categorical predictor. Plotting our exam dataset:

1 numerical predictor

`allEffects()`

`ggeffect()`/`plot_model()`

Plotting coefficients

Sometimes could be useful to plot the estimated sampling distribution of a coefficient. For example, we can plot the tv_shows effect on our example. I’ve written the plot_param() function that directly create a basic-plot given the model and the coefficient name. The plot highlight the null value and the 95% Wald confidence interval.

GLM - Diagnostic

The diagnostic for GLM is similar to standard linear models. Some areas are more complicated for example residual analysis and goodness of fit. We will see:

Deviance
\(R^2\)
Residuals
- Types of residuals
- Residual deviance
Outliers and influential observations

Likelihood

The likelihood is the joint probability of the observed data viewed as a function of the parameters.

\[ \ell(\boldsymbol{\theta} \mid \mathbf{y}) = \log \mathcal{L}(\boldsymbol{\theta} \mid \mathbf{y}) = \sum_{i=1}^n \log f(y_i \mid \boldsymbol{\theta}) \]

x <- rnorm(10) # data from normal distribution

# data
x
#>  [1]  1.22864696  0.31256683  1.66088715  0.74832530 -1.26336813  0.03182055
#>  [7]  0.06549473  0.92536938  0.25166170 -0.28649269

# the model is a normal distribution with mu = 0 and sd = 1
dnorm(x, 0, 1)
#>  [1] 0.1875471 0.3799227 0.1004383 0.3015155 0.1796063 0.3987404 0.3980876
#>  [8] 0.2599950 0.3865070 0.3829015

# log likelihood
sum(log(dnorm(x, 0, 1)))
#> [1] -12.95385

Likelihood

By summing the logarithm all the red segments we obtain the log likelihood of the model given the observed data.

Likelihood

In the previous slide we tried only 3 values for the mean. Let’s image to calculate the log likelihood for several different means. The parameter with that is associated with highest likelihood is called the maximum likelihood estimator. In fact, the \(\mu = 0\) is associated with the highest likelihood.

Deviance

The (residual) deviance in the context of GLM can be considered as the distance between the current model and a perfect model (often called saturated model).

\[ D_{res} = -2[\log(\mathcal{L}_{current}) - (\log\mathcal{L}_{sat})] \]

Where \(\mathcal{L}\) is the likelihood of the considered model (see the previous slides). Clearly, the lower the deviance, the closer the current model to the perfect model suggesting a good fit.

Deviance - Saturated model

The saturated model is a model where each observation \(x_i\) has a parameter.

Deviance - Null model

Another important quantity is the null deviance that is expressed as the distance between the null model and the saturated model.

\[ D_{null} = -2[\log(\mathcal{L}_{null}) - (\log\mathcal{L}_{sat})] \]

The null deviance can be interpreted as the maximal deviance because is estimated using a model without predictors. A good model will have a residual deviance lower than the null model.

Deviance - Null model

Deviance - Likelihood Ratio Test

When we perform a likelihood ratio test (see previous slides) we are essentially comparing the residual deviance (or the likelihood) of two models.

fit_null <- glm(y ~ 1, data = dat, family = binomial(link = "logit"))
fit_current <- glm(y ~ x, data = dat, family = binomial(link = "logit"))

Deviance - Likelihood Ratio Test

With the null model clearly the residual deviance is the same as the null deviance because we are not using predictors to reduce the deviance.

summary(fit_null)

#> 
#> Call:
#> glm(formula = y ~ 1, family = binomial(link = "logit"), data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -0.3228     0.2865  -1.126     0.26
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 68.029  on 49  degrees of freedom
#> Residual deviance: 68.029  on 49  degrees of freedom
#> AIC: 70.029
#> 
#> Number of Fisher Scoring iterations: 4

Deviance - Likelihood Ratio Test

If the predictor x is useful in explaining y the residual deviance will be reduced compared to the null (overall deviance).

summary(fit_current)

#> 
#> Call:
#> glm(formula = y ~ x, family = binomial(link = "logit"), data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   -5.045      1.470  -3.432 0.000599 ***
#> x              9.535      2.659   3.586 0.000336 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 68.029  on 49  degrees of freedom
#> Residual deviance: 36.745  on 48  degrees of freedom
#> AIC: 40.745
#> 
#> Number of Fisher Scoring iterations: 6

Deviance - Likelihood Ratio Test

Comparing the two models we can understand if the deviance reduction can be considered statistically significant.

anova(fit_null, fit_current, test = "LRT")

#> # A tibble: 2 × 5
#>   `Resid. Df` `Resid. Dev`    Df Deviance    `Pr(>Chi)`
#>         <dbl>        <dbl> <dbl>    <dbl>         <dbl>
#> 1          49         68.0    NA     NA   NA           
#> 2          48         36.7     1     31.3  0.0000000223

Notice that the difference between the two residual deviances is the test statistics that is distributed as a \(\chi^2\) with \(df = 1\).

Deviance - Likelihood Ratio Test

Deviance⁴

Deviance - Example

Let’s fit the three models:

# null
fit0 <- glm(exam ~ 1, data = dat, family = binomial(link = "logit"))
# current
fit <- glm(exam ~ tv_shows, data = dat, family = binomial(link = "logit"))
# saturated
fits <- glm(exam ~ 0 + id, data = dat, family = binomial(link = "logit"))

Deviance - Example

We can calculate the residual deviance:

-2*(logLik(fit) - logLik(fits))

#> 'log Lik.' 126.3001 (df=2)

deviance(fit)

#> [1] 126.3001

We can calculate the null deviance:

-2*(logLik(fit0) - logLik(fits))

#> 'log Lik.' 130.6836 (df=1)

deviance(fit0)

#> [1] 130.6836

\(R^2\)

Compared to the standard linear regression, there are multiple ways to calculate an \(R^2\) like measure for GLMs and there is no consensus about the most appropriate method. There are some useful resources:

https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/

To note, some measures are specific for the binomial GLM while other measures can be applied also to other GLMs (e.g., the poisson)

\(R^2\)

We will se:

McFadden’s pseudo-\(R^2\) (for GLMs in general)
Tjur’s \(R^2\) (only for binomial/binary models)

McFadden’s pseudo-\(R^2\)

The McFadden’s pseudo-\(R^2\) compute the ratio between the log-likelihood of the intercept-only (i.e., null) model and the current model(McFadden, 1987):

\[ R^2 = 1 - \frac{\log(\mathcal{L_{current}})}{\log(\mathcal{L_{null}})} \]

There is also the adjusted version that take into account the number of parameters of the model. In R can be computed manually or using the performance::r2_mcfadden():

performance::r2_mcfadden(fit)

#> # R2 for Generalized Linear Regression
#>        R2: 0.034
#>   adj. R2: 0.018

Tjur’s \(R^2\)

This measure is the easiest to interpret and calculate but can only be applied for binomial binary models (Tjur, 2009). Is the absolute value of the difference between the proportions of correctly classifying \(y = 1\) and \(y = 0\) from the model:

\[\begin{align*} \pi_0 = \frac{1}{n_0} \sum_{i = 1}^{n_0} \hat p_i \\ \pi_1 = \frac{1}{n_1} \sum_{i = 1}^{n_1} \hat p_i R^2 = \abs{\pi_1 - \pi_0} \end{align*}\]

performance::r2_tjur(fit)

#>  Tjur's R2 
#> 0.04340278

Tjur’s \(R^2\)

We can do the same manually to check the result:

# predict the probability of each observation
pp <- predict(fit, type = "response")

# predicted probabilities of y = 0
p0 <- mean(pp[fit$y == 0])

# predicted probabilities of y = 1
p1 <- mean(pp[fit$y == 1])

# R2
abs(p1 - p0)

#> [1] 0.04340278

Residuals

The main problem of GLM is that the mean and the variance are linked. For example, the mean of a binomial distribution is \(np\) and the variance is \(np(1-p)\). In the standard linear model we have two parameters, \(\mu\) and the residual \(\sigma\) that are independent.

Code

x <- rnorm(1000)
y <- 0.1 + 0.6 * x + rnorm(1000)
fit <- lm(y ~ x)
ri_lm <- residuals(fit) / sigma(fit)
plot(fitted(fit), ri_lm, 
     ylab = "Fitted Values",
     xlab = "Standardized Residuals")

Residuals

When using a linear model (thus assuming a constant \(\sigma\) and independence between mean and variance) the residuals pattern is very different:

Code

x <- rnorm(1000)
y <- rpois(1000, exp(log(10) + log(1.5) * x))
fit_poi <- lm(y ~ x)

plot(fitted(fit_poi), rstandard(fit_poi),
     xlab = "Standardized Residuals",
     ylab = "Fitted Values")

Residuals

As for standard linear models there are different types of residuals:

raw (response) residuals
pearson residuals
deviance residuals
standardized residuals

Raw Residuals

Raw residuals, also called response residuals are the simplest type of residuals. They are calculated as in standard regression as:

\[\begin{align*} r_i = y_i - \hat y_i \end{align*}\]

Where \(\hat y_i\) are the fitted values where the inverse of the link function has been applied.

In R:

# equivalent to residuals(fit, type = "response") 
ri <- fit$y - fitted(fit)
ri[1:5]

#>          1          2          3          4          5 
#> 0.56826521 0.18039223 0.07097155 0.56848124 0.04399144

The problem is that in GLMs the mean and the variance of the distribution are not independent, creating problems in residual analysis.

Raw Residuals

We can use the function car::residualPlot() to plot different type of residuals against fitted values:

car::residualPlot(fit, type = "response")

Why raw residuals are problematic?

This plot⁵ shows an example with the same residual for two different \(x\) values on a Poisson GLM. Beyond the model itself, the same residual can be considered as extreme for low \(x\) values and plausible for high \(x\) values:

Binned (raw) Residuals

Gelman and colleagues Gelman & Hill (2006) proposed a type of residuals called binned residuals to solve the problem of the previous plot for Binomial GLMs:

divide the fitted values into \(n\) bins. The number is arbitrary but we need each bin to have enough observation to compute a reliable average
calculate the average fitted value and residual for each bin
for each bin we can compute the standard error as \(SE = \frac{\hat p_j (1 - p_j)}{n_j}\) where \(p_j\) is the average fitted probability and \(n_j\) is the number of observation in the bin \(j\)
Then we can plot each bin and the confidence intervals (e.g., as \(\pm 2*SE\)) where ~95% of binned residuals should be within the CI if the model is true

Binned (raw) Residuals

We can use the arm::binnedplot() function to automatically create and plot the binned residuals:

Code

par(mfrow = c(1,2))
plot(x, y, xlab = "Expected Values", ylab = "Raw Residual", main = "Raw Residual Plot")
binnedplot(x,y)

Pearson residuals

Pearson residuals are raw residuals divided by the standard deviation of each residual. Given that the mean-variance relationship of GLMs, dividing by the standard deviation partially solve the mean-variance relationship. The denominator can be calculated just using the appropriate variance formula for that specific GLM.

\[ r_i = \frac{y_i - \hat y_i}{\sqrt{V(\hat y_i)}} \]

Pearson vs Raw residuals

We can see the difference for a Poisson model⁶ when using raw vs pearson residuals. The non-constant variance is controlled on the right.

Pearson residuals

For the Binomial (Bernoulli) GLM, the variance is calculated as \(\hat p(1 - \hat p)\) where \(\hat p\) is the residual value.

\[ r_i = \frac{y_i - \hat y_i}{\sqrt{\hat y_i(1 - \hat y_i)}} \]

# equivalent to residuals(fit, type = "pearson")
yi <- fitted(fit)
ri_pearson <- ri / sqrt(yi * (1 - yi))
ri_pearson[1:5]

#>          1          2          3          4          5 
#> 1.17341398 0.42948479 0.14213499 1.21450070 0.09047435

Deviance residuals

Deviance residuals are based on the residual deviance that we defined before. In fact, the residual deviance was just the sum of squared deviance residuals.

\[ r_i = sign(y_i - \hat y_i) \sqrt{-2[\log \mathcal{L}y_{i_{current}} - \log \mathcal{L}y_{i_{saturated}}]} \] Where \(sign\) is the sign of the raw residual \(i\). For the Bernoulli model the \(\log \mathcal{L}y_{i_{saturated}}\) is always 0.

Deviance Residuals

To calculate and demonstrate manually the deviance residuals we can compute them manually in R:

# residual(fit, type = "deviance")
yhat <- fitted(fit) # fitted
y <- fit$y # actual values

# the likelihood is dbinom
ri_dev <- sign(y - yhat) * sqrt(-2*(log(dbinom(y, 1, yhat)) - log(dbinom(y, 1, y))))

Deviance Residuals

# notice that the log lik of the saturated model log(dbinom(y, 1, y)) is 0
log(dbinom(y, 1, y))

#>  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
#>  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
#> 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
#>  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

ri_dev[1:5]

#>          1          2          3          4          5 
#> -0.2370588  0.3033530 -1.0310244  0.2331303 -0.8429485

residuals(fit, type = "deviance")[1:5]

#>          1          2          3          4          5 
#> -0.2370588  0.3033530 -1.0310244  0.2331303 -0.8429485

Quick recap about hat values in linear regression

The hat matrix \(H\) is calculated as \(H = X \left(X^{\top} X \right)^{-1} X^{\top}\) is a \(n \times n\) matrix where \(n\) is the number of observations. The diagonal of the \(H\) matrix contained the hat values or leverages.

The \(i^{th}\) leverage score (\(h_{ii}\)) is interpreted as the weighted distance between \(x_i\) and the mean of \(x_i\)’s. In practical terms is the \(i^{th}\) observed value influence the \(i^{th}\) fitted value. An high leverage suggest that the observation is far from the mean of predictors and have an high influence on the fitted values.

\(h_{ii}\) ranges between 0 and 1
The sum of all \(h_{ii}\) values is the number of parameters \(p\)
As a rule of thumb, an observation have an high leverage if \(h_{ii} > 2 \bar h\) where \(\bar h\) is the average hat value

Quick recap about hatvalues in linear regression

For a simple linear regression (\(y \sim x\)) the hat values are calculated as:

\[ h_i = \frac{1}{n} + \frac{(X_i - \bar X)^2}{\sum^n_{j = 1}(X_i - \bar X)^2} \]

In R the function hatvalues() return the diagonal of the \(H\) matrix for glm and lm:

hatvalues(fit)[1:10]

#>          1          2          3          4          5          6          7 
#> 0.02874015 0.03492581 0.03968557 0.02782301 0.04357230 0.04652508 0.04435804 
#>          8          9         10 
#> 0.04012613 0.03948651 0.04743780

To note, for GLM and multiple regression in general, the equation is different and more complicated but the intepretation and the R implementation is the same.

Standardized Residuals

Both the response, pearson and deviance residuals can be considered as raw residuals. We can standardize residuals by dividing for the standard error computed with the hat values. In this way, the distribution will be approximately normal with \(\mu = 0\) and \(\sigma = 1\).

\[ r_{s_i} = \frac{r_i}{\sqrt{(1 - \hat h_{ii})}} \]

Where \(r_i\) can be raw, pearson or deviance residuals.

Standardized Residuals

In R they can be extracted using rstandard():

rstandard(fit, type = "pearson")[1:5]

#>          1          2          3          4          5 
#> -0.1712897  0.2208858 -0.8546823  0.1683326 -0.6678440

rstandard(fit, type = "deviance")[1:5]

#>          1          2          3          4          5 
#> -0.2405406  0.3087934 -1.0521126  0.2364427 -0.8619359

Standardized Residuals

We can try to manually calculate the residuals to better understand.

yhat <- fitted(fit) # fitted
yi <- fit$y # observed
hi <- hatvalues(fit) # diagonal of the hat matrix

# pearson residuals
pi <- (yi - yhat) / sqrt(yhat * (1 - yhat))

# standardized
pis <- pi / sqrt(1 - hi)
pis[1:5]

#>          1          2          3          4          5 
#> -0.1712897  0.2208858 -0.8546823  0.1683326 -0.6678440

rstandard(fit, type = "pearson")[1:5]

#>          1          2          3          4          5 
#> -0.1712897  0.2208858 -0.8546823  0.1683326 -0.6678440

Standardized Residuals

# deviance (for binomial GLM the loglik of the saturated model is 0)
di <- sign(yi - yhat) * sqrt(-2*log(dbinom(y, 1, yhat)))

# standardized
dis <- di / sqrt(1 - hi)

dis[1:5]

#>          1          2          3          4          5 
#> -0.2405406  0.3087934 -1.0521126  0.2364427 -0.8619359

rstandard(fit, type = "deviance")[1:5]

#>          1          2          3          4          5 
#> -0.2405406  0.3087934 -1.0521126  0.2364427 -0.8619359

Quantile residuals

The quantile residuals is another proposal for residual analysis. The idea is to map the quantile of the cumulative density function (CDF) of the random component into the CDF of the normal distribution.

Quantile residuals

Quantile residuals are very useful especially for Discrete GLMs (binomial and poisson) and are exactly normally distributed (under respected model assumptions) compared to deviance and pearson residuals (Dunn & Smyth, 1996). They can be calculated using the statmod::qresid(fit) function. Authors suggest to run the function 4 times to disentagle between the randomization and the systematic component.

statmod::qresid(fit)[1:5]

#> [1] -0.33677932  1.04616589 -0.43878900  0.01511622 -1.07265867

statmod::qresid(fit)[1:5] # different every time

#> [1]  1.2349067 -0.6689038  0.2171433 -0.9988869 -0.4189008

Quantile residuals

Classification accuracy/Error rate

The error rate (ER) is defined as the proportion of cases for which the deterministic prediction i.e. guessing \(y_i = 1\) if \(logit^{-1}(\hat y_i) > 0.5\) and guessing \(y_i = 0\) if \(logit^{-1}(\hat y_i) > 0.5\) is wrong. Clearly, \(1 - ER\) is the classification accuracy.

I wrote the error_rate function that simply compute the error rate of a given model:

error_rate <- function(fit) {
  pi <- predict(fit, type = "response")
  yi <- fit$y
  cr <- mean((pi > 0.5 & yi == 1) | (pi < 0.5 & yi == 0))
  1 - cr # error rate
}

Classification accuracy/Error rate

error_rate(fit)

#> [1] 0.14

We could compare the error rate of a given model with the error rate of the null model or another similar model (with a model comparison approach):

fit0 <- update(fit, . ~ -x) # removing the x predictor, now intercept only model
error_rate(fit0)

#> [1] 0.48

# the error rate of the null model is ... greater/less than the actual model
er_ratio = error_rate(fit0)/error_rate(fit)

The error rate of the null model is 3.429 times greater than the actual model.

Classification accuracy/Error rate

For a given model, the error rate should be less than \(0.5\), otherwise setting all \(\beta\)’s to 0 (i.e., null model) would be considered a better model (Gelman et al., 2020).
For example if the average \(p\) in a dataset is \(\overline p = 0.3\) means that a null model would have an error rate of \(0.3\) and a classification accuracy of \(1 - \overline p = 0.7\).
When the proportion of 1s and 0s is the same, the error rate of the null model is 0.5, otherwise is the proportion of 1s
Including predictors we aim to reduce the error rate (compared to the null model) because the ability to correctly classify an observation increase.
The error rate can be misleading, see Gelman et al. (2020) (pp. 255-256)

Outliers and influential observations

Identification of influential observation and outliers of GLMs is very similar to standard regression models. We will briefly see:

Cook Distances
DFBETAs

Cook Distances

The Cook Distance of an observation \(i\) measured the impact of that observation on the overall model fit. If removing the observation \(i\) has an high impact, the observation \(i\) is likely an influential observation. For GLMs they are defined as:

\[\begin{align*} D_i = \frac{r_i^2}{\phi p} \frac{h_{ii}}{1 - h_{ii}} \end{align*}\]

Where \(p\) is the number of model parameters, \(r_i\) are the standardized pearson residuals (rstandard(fit, type = "pearson")) and \(h_{ii}\) are the hatvalues (leverages). \(\phi\) is the dispersion parameter of the GLM that for binomial and poisson models is fixed to 1 (see Dunn (2018, Table 5.1)) Usually an observation is considered influential if \(D_i > \frac{4}{n}\) where \(n\) is the sample size.

DFBETAs

DFBETAs measure the impact of the observation \(i\) on the estimated parameter \(\beta_j\):

\[\begin{align*} DFBETAS_i = \frac{\beta_j - \beta_{j(i))}}{\sigma_{\beta_{j(i)}}} \end{align*}\]

Where \(i\) denote the parameters and standard error on a model fitted without the \(i\) observation⁷. Usually an observation is considered influential if \(|DFBETAs_{i}| > \frac{2}{\sqrt{n}}\) where \(n\) is the sample size.

Extracting influence measures

In R we can use the influence.measures() function to calculate each influence measure explained before⁸.

fit <- update(fit, data = fit$model[1:50, ])

infl <- influence.measures(fit)$infmat
head(infl)

#>        dfb.1_      dfb.x       dffit    cov.r       cook.d        hat
#> 1 -0.04419414 0.04109800 -0.04442749 1.070827 0.0004340963 0.02874015
#> 2 -0.04672053 0.05681844  0.06310261 1.075710 0.0008828590 0.03492581
#> 3 -0.11865551 0.05713533 -0.23265873 1.028011 0.0150937972 0.03968557
#> 4 -0.03348170 0.03972704  0.04294688 1.069920 0.0004054759 0.02782301
#> 5 -0.14588644 0.10150547 -0.19921442 1.051194 0.0101596403 0.04357230
#> 6 -0.07575414 0.10700698  0.14303484 1.074641 0.0048052405 0.04652508

The first two columns are the DFBETAs for each parameter, the cook.d columns contains the Cook Distances and hat is the diagonal of the \(H\) matrix.

Plotting influence measures

I wrote the cook_plot() function to easily plot the cook distances along with the identification of influential observations:

cook_plot(fit)

Plotting influence measures

I wrote the dfbeta_plot() function to easily plot the cook distances along with the identification of influential observations:

dfbeta_plot(fit)

Binomial vs Binary

There are several practical differences between binomial and binary models:

data structure
fitting function in R
residuals and residual deviance
type of predictors

Binomial vs Binary data structure

The most basic Binomial regression is a vector of binary \(y\) values and a continuous or categorical predictor. Let’s see a common data structure in this case:

Code

n <- 30
x <- runif(n, 0, 1)
dat <- sim_design(ns = n, nx = list(x = x))
b0 <- qlogis(0.01)
b1 <- 8

dat |> 
    sim_data(plogis(b0 + b1*x), "binomial") |> 
    select(-lp) |> 
    round(2) |> 
    filor::trim_df(4)

#> # A tibble: 9 × 3
#>   id    x     y    
#>   <chr> <chr> <chr>
#> 1 1     0.85  1    
#> 2 2     0.16  0    
#> 3 3     0.08  0    
#> 4 4     0.17  0    
#> 5 ...   ...   ...  
#> 6 27    0.52  1    
#> 7 28    0.64  1    
#> 8 29    0.11  0    
#> 9 30    0.54  1

Binomial vs Binary data structure

Or equivalently with a categorical variable:

n <- 15
x <- c("a", "b")
dat <- sim_design(n, cx = list(x = x))
b0 <- qlogis(0.4)
b1 <- log(odds_ratio(0.7, 0.4))

dat |> 
    sim_data(plogis(b0 + b1*x_c), "binomial") |> 
    dplyr::select(-x_c, -lp) |> 
    filor::trim_df(4)

#> # A tibble: 9 × 3
#>   id    x     y    
#>   <chr> <chr> <chr>
#> 1 1     a     1    
#> 2 2     b     0    
#> 3 3     a     0    
#> 4 4     b     0    
#> 5 ...   ...   ...  
#> 6 27    a     0    
#> 7 28    b     1    
#> 8 29    a     0    
#> 9 30    b     0

Binomial vs Binary data structure

When using a Binomial data structure we count the number of success for each level of \(x\). nc is the number of 1 responses, nf is the number of 0 response out of nt trials:

Code

x <- seq(0, 1, 0.05)
nt <- 10
nc <- rbinom(length(x), nt, plogis(qlogis(0.01) + 8*x))
dat <- data.frame(x, nc, nf = nt - nc,  nt)

dat |> 
    filor::trim_df(4)

#> # A tibble: 9 × 4
#>   x     nc    nf    nt   
#>   <chr> <chr> <chr> <chr>
#> 1 0     0     10    10   
#> 2 0.05  0     10    10   
#> 3 0.1   1     9     10   
#> 4 0.15  0     10    10   
#> 5 ...   ...   ...   ...  
#> 6 0.85  9     1     10   
#> 7 0.9   8     2     10   
#> 8 0.95  9     1     10   
#> 9 1     10    0     10

Binomial vs Binary data structure

With a categorical variable we have essentially a contingency table:

Code

x <- c("a", "b")
nt <- 10
nc <- rbinom(length(x), nt, plogis(qlogis(0.4) + log(odds_ratio(0.7, 0.4))*ifelse(x == "a", 1, 0)))

datc <- data.frame(x, nc, nf = nt - nc,  nt)
datc

#> # A tibble: 2 × 4
#>   x        nc    nf    nt
#>   <chr> <int> <dbl> <dbl>
#> 1 a         5     5    10
#> 2 b         4     6    10

Binomial vs Binary: data structure

Clearly, expanding or aggregating data is the way to convert a binary into a binomial data structure and the opposite:

# from binomial to binary
bin_to_binary(datc, nc, nt) |> 
    select(y, x) |> 
    filor::trim_df()

#> # A tibble: 9 × 2
#>   y     x    
#>   <chr> <chr>
#> 1 1     a    
#> 2 1     a    
#> 3 1     a    
#> 4 1     a    
#> 5 ...   ...  
#> 6 0     b    
#> 7 0     b    
#> 8 0     b    
#> 9 0     b

Binomial vs Binary: data structure

# from binary to binomial
bin_to_binary(datc, nc, nt)  |>
    binary_to_bin(y, x)

#> # A tibble: 2 × 4
#>   x        nc    nf    nt
#>   <chr> <dbl> <dbl> <int>
#> 1 a         5     5    10
#> 2 b         4     6    10

Binomial vs Binary: multilevel disclaimer

Clearly, in the previous examples, each row correspond to a single observation (in the binary model) or a condition (in the binomial model). Furthermore each row/observation/participants is assumed to be independent.
If participants performed multiple trials for each condition, we can still use a binomial or binary model BUT we need to take into account the multilevel structure.
In practice, we need to use a mixed-effects GLM, for example with the lme4::glmer() package specifying the appropriate random structure.

Binomial vs Binary: - fitting function in R

There is a single way to implement a binary model in R and two ways for the binomial version:

# binary regression
glm(y ~ x, family = binomial(link = "logit"))

# binomial with cbind syntax, nc = number of 1s, nf = number of 0s, nc + nf = nt
glm(cbind(nc, nf) ~ x, family = binomial(link = "logit"))

# binomial with proportions and weights, equivalent to the cbind approach, nt is the total trials
glm(nc/nt ~ x, weights = nt, binomial(link = "logit"))

Binomial vs Binary: residuals and residual deviance

A more relevant difference is about the residual analysis. The binary regression has different residuals compared to the binomial model fitted on the same dataset⁹.

fit_binomial <- glm(nc/nt ~ x, weights = nt, data = dat_binomial, family = binomial(link = "logit"))
fit_binary <- glm(y ~ x, data = dat_binary, family = binomial(link = "logit"))

Binomial vs Binary: residuals and residual deviance

The residual deviance is also different. In fact, there is more residual deviance on the binary compared to the binomial model. However, comparing two binary and binomial models actually leads to the same conclusion. In other terms the deviance seems to be on a different scale¹⁰:

fit0_binary <- update(fit_binary, . ~ -x) # null binary model
fit0_binomial <- update(fit_binomial, . ~ -x) # null binary model

Binomial vs Binary: residuals and residual deviance

In fact, if we compare the full model with the null model in both binary and binomial version, the LRT is the same but the deviance is on a different scale. Comparing a binary with a binomial model could be completely misleading.

anova(fit_binomial, fit0_binomial, test = "Chisq")
#> # A tibble: 2 × 5
#>   `Resid. Df` `Resid. Dev`    Df Deviance `Pr(>Chi)`
#>         <dbl>        <dbl> <dbl>    <dbl>      <dbl>
#> 1          19         14.4    NA      NA   NA       
#> 2          20        148.     -1    -133.   8.27e-31
anova(fit_binary, fit0_binary, test = "Chisq")
#> # A tibble: 2 × 5
#>   `Resid. Df` `Resid. Dev`    Df Deviance `Pr(>Chi)`
#>         <dbl>        <dbl> <dbl>    <dbl>      <dbl>
#> 1         208         156.    NA      NA   NA       
#> 2         209         289.    -1    -133.   8.27e-31

Binomial vs Binary: type of predictors

This point is less relevant in this course but important in general. Usually, binary regression is used when the predictor is at the trial level whereas binomial regression is used when the predictor is at the participant level. When both levels are of interests one should use a mixed-effects model where both levels can be modeled.

the probability of correct responses during an exam as a function of the question difficulty (each trial/row could have a different \(x\) level)
the probability of passing the exam as a function of the high-school background regardless of having multiple trials, each trial/row has the same \(x_i\) for the participant \(i\)

Binomial vs Binary: type of predictors

The probability of correct responses during an exam as a function of the question difficulty

each question (i.e., trial) has a 0/1 response and a difficulty level
we are modelling a single participant, otherwise we need a multilevel (mixed-effects) model

Code

dat <- expand_grid(id = 1, question = 1:30)
dat$y <- rbinom(nrow(dat), 1, 0.7)
dat$difficulty <- sample(1:5, nrow(dat), replace = TRUE)

dat |> 
    select(id, question, difficulty, y) |> 
    filor::trim_df()

#> # A tibble: 9 × 4
#>   id    question difficulty y    
#>   <chr> <chr>    <chr>      <chr>
#> 1 1     1        1          1    
#> 2 1     2        3          1    
#> 3 1     3        3          1    
#> 4 1     4        1          0    
#> 5 ...   ...      ...        ...  
#> 6 1     27       1          1    
#> 7 1     28       5          1    
#> 8 1     29       1          0    
#> 9 1     30       1          1

Binomial vs Binary: type of predictors

the probability of passing the exam as a function of the high-school background

each “background” has different students that passed or not the exam (0/1)

#> # A tibble: 4 × 4
#>   background nc    nf    nt   
#>   <chr>      <chr> <chr> <chr>
#> 1 math       20    10    30   
#> 2 chemistry  16    4     20   
#> 3 art        4     6     10   
#> 4 sport      15    5     20

Or the binary version:

#> # A tibble: 9 × 5
#>   background nc    nf    nt    y    
#>   <chr>      <chr> <chr> <chr> <chr>
#> 1 math       20    10    30    1    
#> 2 math       20    10    30    1    
#> 3 math       20    10    30    1    
#> 4 math       20    10    30    1    
#> 5 ...        ...   ...   ...   ...  
#> 6 sport      15    5     20    0    
#> 7 sport      15    5     20    0    
#> 8 sport      15    5     20    0    
#> 9 sport      15    5     20    0

Binomial vs Binary: type of predictors

To note that despite we can convert between the binary/binomial, the two models are not always the same. The high-school background example can be easily modelled either with a binary or binomial model because the predictor is at the participant level that coincides with the trial level.
On the other side, the question difficulty example can only be modelled using a binary regression because each trial (0/1) has a different value for the predictor
To include both predictors or to model multiple participants on the question difficulty example we need a mixed-effects model where both levels together with the repeated-measures can be handled.
a very clear overview of this topic can be found here https://www.rensvandeschoot.com/tutorials/generalised-linear-models-with-glm-and-lme4/

Probit link

The mostly used link function when using a binomial GLM is the logit link. The probit link is another link function that can be used. The overall approach is the same between logit and probit models. The only difference is the parameter interpretation (i.e., no odds ratios) and the specific link function (and the inverse) to use.
The probit model use the cumulative normal distribution but the actual difference with a logit functions is neglegible.

Probit link

When using the probit link the parameters are interpreted as difference in z-scores associated with a unit increase in the predictors. In fact probabilities are mapped into z-scores using the cumulative normal distribution.

p1 <- 0.7
p2 <- 0.5

qlogis(c(p1, p2)) # log(odds(p1)), logit link

#> [1] 0.8472979 0.0000000

qnorm(c(p1, p2)) # probit link

#> [1] 0.5244005 0.0000000

log(odds_ratio(p1, p2)) # ~ beta1, logit link

#> [1] 0.8472979

pnorm(p1) - pnorm(p2) # ~beta1, probit link

#> [1] 0.06657389

Probit link

References

Agresti, A. (2015). Foundations of linear and generalized linear models. John Wiley & Sons. https://play.google.com/store/books/details?id=jlIqBgAAQBAJ

Arel-Bundock, V., Greifer, N., & Heiss, A. (2024). How to interpret statistical models using marginaleffects for r and python. Journal of Statistical Software, 111, 1–32. https://doi.org/10.18637/jss.v111.i09

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to Meta-Analysis. https://doi.org/10.1002/9780470743386

Dunn, P. K., & Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics: A Joint Publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America, 5, 236–244. https://doi.org/10.1080/10618600.1996.10474708

Dunn, P. K., & Smyth, G. K. (2018). Generalized linear models with examples in r. Springer. https://play.google.com/store/books/details?id=tBh5DwAAQBAJ

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879

Knoblauch, K., & Maloney, L. T. (2012). Modeling psychophysical data in r. Springer New York. https://doi.org/10.1007/978-1-4614-4475-6

McFadden, D. (1987). Regression-based specification tests for the multinomial logit model. Journal of Econometrics, 34, 63–82. https://doi.org/10.1016/0304-4076(87)90067-4

Sánchez-Meca, J., Marín-Martínez, F., & Chacón-Moscoso, S. (2003). Effect-size indices for dichotomized outcomes in meta-analysis. Psychological Methods, 8, 448–467. https://doi.org/10.1037/1082-989X.8.4.448

Tjur, T. (2009). Coefficients of determination in logistic regression models—a new proposal: The coefficient of discrimination. The American Statistician, 63, 366–372. https://doi.org/10.1198/tast.2009.08210

Thanks to Prof. Paolo Girardi for the example, see https://teddychild.dpss.psy.unipd.it/ for information↩︎
see also the investr package↩︎
Notice that I use 2*se instead of the precise quantile of the Gaussian distribution that for \(\alpha = 0.05\) is \(\sim 1.96\)↩︎
Adapted from https://bookdown.org/egarpor/SSS2-UC3M/logreg-deviance.html↩︎
Adapted from Dunn (2018), Fig. 8.1↩︎
We are using a Poisson model only because the residual pattern is more clear compared to a binomial model↩︎
Note that you are not required to re-fit the model to estimate the DFBETAs, you can use the \(H\) matrix, see https://en.wikipedia.org/wiki/Influential_observation↩︎
The function actually computes also other influence measures, see Dunn (2018, section 8.8.3) for other details↩︎
To note, the binned residuals could be considered as an attempt to mitigate the binary residuals problem by creating bins of fitted/residual values similarly to fitting a binomial model.↩︎
Thanks to Prof. Altoè for this suggestion↩︎

Binomial GLM

Example: Passing the exam

Example: Passing the exam

Example: Passing the exam

Example: Passing the exam - Odds

Example: Passing the exam - Odds

Example: Passing the exam - Odds Ratio

Example: Passing the exam - Odds Ratio

Why using these measure?

Another example, Teddy Child

Another example, Teddy Child

Another example, Teddy Child

Another example, Teddy Child

Another example, Teddy Child

Another example, Teddy Child

Another example, Teddy Child

Another example, Teddy Child

Binomial GLM

Logit Link

Logit Link

Logit Link

Model fitting in R

The big picture…

Model fitting in R

Intercept only model

Intercept only model

Intercept only model

Link function (TIPS)

Model with \(X\)

Model with \(X\)

Model with \(X\)

Parameter Intepretation

Model Intepretation

marginaleffects package

marginaleffects package

marginaleffects vs emmeans

Intepreting model coefficients

Categorical predictors

Categorical predictors

Categorical predictors

Numerical predictors

Numerical predictors

Numerical predictors

Numerical predictors

Numerical predictors

Numerical predictors

Numerical predictors

Numerical predictors

Numerical predictors

Predicted probabilities

Predicted probabilities

Marginal effects

Marginal effects

Marginal effects

Marginal effects

Marginal effects

Divide by 4 rule

Divide by 4 rule

Misleading divide by 4 rule

Misleading divide by 4 rule

Inverse estimation2

Inverse estimation

Inverse estimation and Psychophysics

Inverse estimation and Psychophysics

Inverse estimation and Psychophysics

Inverse estimation and Psychophysics

Odds Ratio (OR) to Cohen’s \(d\)

Odds Ratio (OR) to Cohen’s \(d\)

Inference

Wald tests

Wald-type confidence intervals

Profile likelihood confidence intervals

Profile likelihood confidence intervals

Confidence intervals

Confidence intervals3

Confidence intervals

Anova

Anova

Anova

Anova

`marginaleffects` package

`marginaleffects` package

`marginaleffects` vs `emmeans`

Inverse estimation²

Confidence intervals³

`allEffects()`

`ggeffect()`/`plot_model()`

Deviance⁴