#> # A tibble: 9 × 2
#> tv_shows exam
#> <chr> <chr>
#> 1 1 0
#> 2 1 0
#> 3 1 1
#> 4 1 1
#> 5 ... ...
#> 6 0 0
#> 7 0 0
#> 8 0 1
#> 9 0 1
Filippo Gambarota
University of Padova
2023
Last Update: 2023-11-29
We want to measure the impact of watching tv-shows on the probability of passing the statistics exam.
exam
: passing the exam (1 = “passed”, 0 = “failed”)tv_shows
: watching tv-shows regularly (1 = “yes”, 0 = “no”)#> # A tibble: 9 × 2
#> tv_shows exam
#> <chr> <chr>
#> 1 1 0
#> 2 1 0
#> 3 1 1
#> 4 1 1
#> 5 ... ...
#> 6 0 0
#> 7 0 0
#> 8 0 1
#> 9 0 1
We can create the contingency table
xtabs(~exam + tv_shows, data = dat) |>
addmargins()
#> tv_shows
#> exam 0 1 Sum
#> 0 35 22 57
#> 1 15 28 43
#> Sum 50 50 100
Each cell probability \(p_{ij}\) is computed as \(p_{ij}/n\)
(xtabs(~exam + tv_shows, data = dat)/n) |>
addmargins()
#> tv_shows
#> exam 0 1 Sum
#> 0 0.35 0.22 0.57
#> 1 0.15 0.28 0.43
#> Sum 0.50 0.50 1.00
The most common way to analyze a 2x2 contingency table is using the odds ratio (OR). Firsly let’s define the odds of success as:
\[\begin{align*} odds = \frac{p}{1 - p} \\ p = \frac{odds}{odds + 1} \end{align*}\]
For the exam
example:
The OR is a ratio of odds:
\[ OR = \frac{\frac{p_1}{1 - p_1}}{\frac{p_2}{1 - p_2}} \]
The odds have an interesting property when taking the logarithm. We can express a probability \(p\) using a scale ranging \([-\infty, +\infty]\)
We considered a Study conducted by the University of Padua (TEDDY Child Study, 2020)1. Within the study, researchers asked the participants (mothers of a young child) about the presence of post-partum depression and measured the parental stress using the PSI-Parenting Stress Index.
ID | Parental_stress | Depression_pp |
---|---|---|
1 | 75 | No |
2 | 51 | No |
3 | 76 | No |
4 | 88 | No |
... | ... | ... |
376 | 67 | No |
377 | 71 | No |
378 | 63 | No |
379 | 70 | No |
We want to see if the parental stress increase the probability of having post-partum depression:
Let’s start by fitting a linear model Depression_pp ~ Parental_stress
. We consider “Yes” as 1 and “No” as 0.
#>
#> Call:
#> lm(formula = Depression_pp01 ~ Parental_stress, data = teddy)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.42473 -0.13768 -0.10003 -0.05768 0.94702
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.172900 0.077561 -2.229 0.026389 *
#> Parental_stress 0.004706 0.001201 3.919 0.000105 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3239 on 377 degrees of freedom
#> Multiple R-squared: 0.03915, Adjusted R-squared: 0.0366
#> F-statistic: 15.36 on 1 and 377 DF, p-value: 0.0001054
Let’s add the fitted line to our plot:
… and check the residuals, pretty bad right?
As for the exam example, we could compute a sort of contingency table despite the Parental_stress
is a numerical variable by creating some discrete categories (just for exploratory analysis):
table(teddy$Depression_pp, teddy$Parental_stress_c) |>
prop.table(margin = 2) |>
round(2)
#>
#> < 40 40-60 60-80 80-100 > 100
#> No 0.92 0.87 0.79 0.60
#> Yes 0.08 0.13 0.21 0.40
Ideally, we could compute the increase in the odds of having the post-partum depression as the parental stress increase. In fact, as we are going to see, the Binomial GLM is able to estimate the non-linear increase in the probability.
The logit link is the most common link function when using a binomial GLM:
\[ log \left(\frac{p}{1 - p}\right) = \beta_0 + \beta_{1}X_{1} + ...\beta_pX_p \]
The inverse of the logit maps again the probability into the \([0, 1]\) range:
\[ p = \frac{e^{\beta_0 + \beta_{1}X_{1} + ...\beta_pX_p}}{1 + e^{\beta_0 + \beta_{1}X_{1} + ...\beta_pX_p}} \]
Thus with a single numerical predictor \(x\) the relationship between \(x\) and \(p\) in non-linear on the probability scale but linear on the logit scale.
The problem is that effects are non-linear, thus is more difficult to interpret and report model results
We can model the contingency table presented before. We put data in binary form:
#> tv_shows
#> exam 0 1
#> 0 35 22
#> 1 15 28
#> # A tibble: 9 × 2
#> tv_shows exam
#> <chr> <chr>
#> 1 1 0
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 ... ...
#> 6 0 0
#> 7 0 1
#> 8 0 0
#> 9 0 0
Let’s start from the simplest model (often called null model) where there are no predictors:
#>
#> Call:
#> glm(formula = exam ~ 1, family = binomial(link = "logit"), data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.060 -1.060 -1.060 1.299 1.299
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.2819 0.2020 -1.395 0.163
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 136.66 on 99 degrees of freedom
#> Residual deviance: 136.66 on 99 degrees of freedom
#> AIC: 138.66
#>
#> Number of Fisher Scoring iterations: 4
When fitting an intercept-only model, the parameter is the average value of the y
variable:
\[\begin{align*} \log(\frac{p}{1 - p}) = \beta_0 \\ p = \frac{e^{\beta_0}}{1 + e^{\beta_0}} \end{align*}\]
In R, the \(logit(p)\) is computed using qlogis()
that is the q
+ logis
combination of functions to work with probability distributions. The \(logit^{-1}\) thus the inverse of the logit function is plogis()
:
If you are not sure about how to transform using the link function you can directly access the family()
object in R that contains the appropriate link function and the corresponding inverse.
bin <- binomial(link = "logit")
bin$linkfun() # the same as qlogis
bin$linkinv() # the same as plogis
Now we can add the tv_shows
predictor. Now the model has two coefficients. Given that the tv_shows
is a binary variable, the intercept is the average y
when tv_shows
is 0, and the tv_shows
coefficient is the increase in y
for a unit increase in tv_shows
:
#>
#> Call:
#> glm(formula = exam ~ tv_shows, family = binomial(link = "logit"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2814 -0.8446 -0.8446 1.0769 1.5518
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.8473 0.3086 -2.746 0.00604 **
#> tv_shows 1.0885 0.4200 2.592 0.00956 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 136.66 on 99 degrees of freedom
#> Residual deviance: 129.68 on 98 degrees of freedom
#> AIC: 133.68
#>
#> Number of Fisher Scoring iterations: 4
Thinking about our data, the (Intercept)
is the probability of passing the exam without watching tv-shows. The tv_shows
(should be) the difference in the probability of passing the exam between people who watched or did not watched tv-shows, BUT:
tv_shows
we obtain the ratio of odds of passing the exam watching vs non-watching tv-shows. Do you remember something?The tv_shows
is exactly the Odds Ratio that we calculated on the contingency table:
Given the non-linearity and the link function, parameter intepretation is not easy for GLMs. In the case of the Binomial GLM we will see:
Models coefficients are interpreted in the same way as standard regression models. The big difference concerns:
Using contrast coding and centering/standardizing we can make model coefficients easier to intepret.
We use a categorical predictor with \(p\) levels, the model will estimate \(p - 1\) parameters. The interpretation of these parameters is controlled by the contrast coding. In R the default is the treatment coding (or dummy coding). Essentially R create \(p - 1\) dummy variables (0 and 1) where 0 is the reference level (usually the first category) and 1 is the current level. We can see the coding scheme using the model.matrix()
function that return the \(\boldsymbol{X}\) matrix:
#> # A tibble: 9 × 2
#> X.Intercept. tv_shows
#> <chr> <chr>
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 ... ...
#> 6 1 0
#> 7 1 0
#> 8 1 0
#> 9 1 0
In the simple case of the exam
dataset, the intercept (\(\beta_0\)) is the reference level (default to 0 because is the first) and \(\beta_0\) is the difference between the actual level and the reference level. If we change the order of the levels we could change the intercept value while \(\beta_1\) will be the same. As an example we could use the so-called sum to zero coding where instead of assigning 0 and 1 we use different values. For example assigning -0.5 and 0.5 will make the intercept the grand-mean:
With numerical predictors the idea is the same as categorical predictors. In fact, categorical predictors are converted into numbers (i.e., 0 and 1 or -0.5 and 0.5). The only caveat is that the effects are linear only the logit scale. Thus \(\beta_1\) is interpreted in the same way as standard linear models only on the link-function scale. For the binomial
GLM the \(\beta_1\) is the increase in the \(log(odds(p))\) for a unit-increase in the \(x\). In the response (probability) scale, the \(\beta_1\) is the multiplicative increase in the odds of \(y = 1\) for a unit increase in the predictor.
With numerical predictors we could mean-center and or standardize the predictors. With centering (similarly to the categorical example) we change the interpretation of the intercept. Standardizing is helpful to have more meaningful \(\beta\) values. The \(\beta_i\) of a centered predictor is the increase in \(y\) for a increase in one standard deviation of \(x\).
\[ x_{cen} = x_i - \hat x \]
\[ x_{z} = \frac{x_i - \hat x}{\sigma_x} \]
Let’s return to our teddy child example and fitting the proper model:
fit_glm <- glm(Depression_pp01 ~ Parental_stress, data = teddy, family = binomial(link = "logit"))
summary(fit_glm)
#>
#> Call:
#> glm(formula = Depression_pp01 ~ Parental_stress, family = binomial(link = "logit"),
#> data = teddy)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2852 -0.5165 -0.4509 -0.3861 2.3096
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.323906 0.690689 -6.260 3.84e-10 ***
#> Parental_stress 0.036015 0.009838 3.661 0.000251 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 284.13 on 378 degrees of freedom
#> Residual deviance: 271.23 on 377 degrees of freedom
#> AIC: 275.23
#>
#> Number of Fisher Scoring iterations: 5
The (Intercept)
(\(\beta_0\)) is the probability of having post-partum depression for a mother with parental stress zero (maybe better centering?)
\[ p(yes|x = 0) = g^{-1}(\beta_0) \]
The Parental_stress
(\(\beta_1\)) is the increase in the \(log(odds)\) of having the post-partum depression for a unit increase in the parental stress index. If we take the exponential of \(\beta_1\) we obtain the increase in the \(odds\) of having post-partum depression for a unit increase in parental stress index.
The problem is that, as shown before, the effects are non-linear on the probability scale while are linear on the logit scale. On the logit scale, all differences are constant:
While on the probability scale, the differences are not the same. This is problematic when interpreting the results of a Binomial GLM with a numerical predictor.
The divide by 4 rule is a very easy way to evaluate the effect of a continuous predictor.
Given the non-linearity, the derivative of the logistic function (i.e., the slope) is maximal for probabilities around ~\(0.5\).
In fact, \(\beta_i p (1 - p)\) is maximized when \(p = 0.5\) turning into \(\beta_i 0.25\) (i.e., dividing by 4).
Dividing \(\beta/4\) we obtain the maximal slope thus the maximal difference in probability.
In a similar way we can use the inverse logit function to find the predicted probability specific values of \(x\). For example, the difference between \(p(y = 1|x = 2.5) - p(y = 1|x = 5)\) can be calculated using the model equation:
In R we can use directly the predict()
function with the argument type = "response"
to return the predicted probabilities instead of the logits:
I have written the epredict()
function that extend the predict()
function giving some useful messages when computing predictions. you can use it with every model and also with multiple predictors.
A marginal effect measures the association between a change in a predictor (\(x\)), and a change in the response \(y\).
The slope can be evaluated for any level of \(x\). In fact, the divide-by-4 rule is the maximal slope evaluated at a specific \(x\).
The divide-by-4 rule is the maximal marginal effect. The average marginal effects is the average of all slopes. This can be easily done with the marginaleffects
package:
library(marginaleffects)
# all marginal effects
slopes(fit)
#> # A tibble: 100 × 14
#> rowid term estimate std.error statistic p.value s.value conf.low conf.high
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 x 0.00403 0.00354 1.14 2.55e-1 1.97 -2.91e-3 0.0110
#> 2 2 x 0.137 0.0277 4.95 7.47e-7 20.4 8.28e-2 0.191
#> 3 3 x 0.0489 0.0213 2.30 2.15e-2 5.54 7.23e-3 0.0906
#> 4 4 x 0.0383 0.0188 2.04 4.15e-2 4.59 1.48e-3 0.0750
#> 5 5 x 0.222 0.0407 5.45 4.91e-8 24.3 1.42e-1 0.302
#> 6 6 x 0.0526 0.0193 2.72 6.48e-3 7.27 1.47e-2 0.0905
#> 7 7 x 0.0811 0.0229 3.54 3.98e-4 11.3 3.62e-2 0.126
#> 8 8 x 0.199 0.0386 5.16 2.52e-7 21.9 1.23e-1 0.275
#> 9 9 x 0.00353 0.00319 1.11 2.68e-1 1.90 -2.72e-3 0.00979
#> 10 10 x 0.0270 0.0136 1.99 4.68e-2 4.42 3.82e-4 0.0537
#> # ℹ 90 more rows
#> # ℹ 5 more variables: predicted_lo <dbl>, predicted_hi <dbl>, predicted <dbl>,
#> # y <int>, x <dbl>
# marginal effects when x = 5
slopes(fit, newdata = data.frame(x = 5))
#> # A tibble: 1 × 14
#> rowid term estimate std.error statistic p.value s.value conf.low conf.high
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 x 0.177 0.0338 5.22 1.76e-7 22.4 0.110 0.243
#> # ℹ 5 more variables: predicted_lo <dbl>, predicted_hi <dbl>, predicted <dbl>,
#> # x <dbl>, y <int>
# average marginal effect
avg_slopes(fit)
#> # A tibble: 1 × 8
#> term estimate std.error statistic p.value s.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 x 0.0934 0.00312 29.9 5.94e-197 652. 0.0873 0.0995
With multiple variables, marginal effects are also useful to see the effect of one predictor fixing the level of others. Let’s simulate a model with two predictors:
We can see the marginal effects (each, average or marginal) for the \(x1\) while fixing \(x2\) to the mean:
slopes(fit, newdata = data.frame(x1 = dat$x1, x2 = mean(dat$x2)))
#> # A tibble: 200 × 15
#> rowid term estimate std.error statistic p.value s.value conf.low conf.high
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 x1 0.0991 0.0624 1.59 0.112 3.16 -0.0232 0.221
#> 2 2 x1 0.191 0.195 0.979 0.327 1.61 -0.191 0.573
#> 3 3 x1 0.0805 0.0419 1.92 0.0546 4.20 -0.00157 0.163
#> 4 4 x1 0.0555 0.0227 2.45 0.0145 6.11 0.0110 0.100
#> 5 5 x1 0.0499 0.0204 2.45 0.0145 6.11 0.00990 0.0899
#> 6 6 x1 0.0998 0.0632 1.58 0.114 3.13 -0.0241 0.224
#> 7 7 x1 0.135 0.110 1.23 0.219 2.19 -0.0803 0.350
#> 8 8 x1 0.0605 0.0254 2.38 0.0174 5.84 0.0106 0.110
#> 9 9 x1 0.220 0.241 0.913 0.361 1.47 -0.252 0.691
#> 10 10 x1 0.152 0.134 1.13 0.258 1.95 -0.111 0.414
#> # ℹ 190 more rows
#> # ℹ 6 more variables: predicted_lo <dbl>, predicted_hi <dbl>, predicted <dbl>,
#> # x1 <dbl>, x2 <dbl>, y <int>
Sometimes it is useful to do what is called inverse estimation, thus predicting the \(x\) level associated with a certain \(y\). In this case the \(x\) producing a certain \(p\) of response.
Sometimes this is called median effective dose when finding the \(x\) level producing \(50\%\) of correct responses.
We can use the MASS::dose.p()
function:
In Psychophysics it is common to do the inverse estimation to estimate the detection or performance threshold. For example, estimating the \(x\) level associated with a certain probability of success e.g. 50%.
Let’s simulate and ideal observer that respond to stimuli with different contrast:
th <- 0.7 # -(b0/b1)
slope <- 1/8
b1 <- 1/slope
b0 <- -th*b1
x <- rep(seq(0, 1, 0.1), each = 20)
p <- plogis(b0 + b1 * x)
y <- rbinom(length(x), 1, p)
yp <- tapply(y, x, mean)
xc <- unique(x)
plot(xc, yp, type = "b", ylim = c(0, 1), xlab = "Contrast", ylab = "Probability of Detection")
curve(plogis(x, th, slope), add = TRUE, col = "firebrick", lwd = 2)
#>
#> Call:
#> glm(formula = y ~ x, family = binomial(link = "logit"))
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.0471 -0.6306 -0.2320 0.5124 2.6938
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.9927 0.6644 -7.515 5.69e-14 ***
#> x 6.9569 0.9410 7.393 1.44e-13 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 273.67 on 219 degrees of freedom
#> Residual deviance: 163.71 on 218 degrees of freedom
#> AIC: 167.71
#>
#> Number of Fisher Scoring iterations: 6
We can estimate the threshold and slope of the psychometric function using the equations from Knoblauch & Maloney (2012).
The Odds Ratio can be considered an effect size measure. We can transform the OR into other effect size measure (Borenstein et al., 2009; Sánchez-Meca et al., 2003).
\[ d = \log(OR) \frac{\sqrt{3}}{\pi} \]
# in R with the effectsize package
or <- 1.5
effectsize::logoddsratio_to_d(log(or))
#> [1] 0.2235446
# or
effectsize::oddsratio_to_d(or)
#> [1] 0.2235446
The basic approach when doing inference with GLM is interpreting the Wald test of each model coefficients. The Wald test is calculated as follows:
\[ z = \frac{\beta_i - \beta_0}{\sigma_{\beta_i}} \]
Where \(\beta\) is the regression coefficients, \(\beta_0\) is the value under the null hypothesis (generally 0) and \(\sigma_{\beta_i}\) is the standard error. P-values and confidence intervals can be calculated based on a standard normal distribution.
Wald-type confidence interval (directly from model summary), where \(\Phi\) is the cumulative Gaussian function qnorm()
:
\[ 95\%CI = \hat \beta \pm \Phi(\alpha/2) SE_{\beta} \]
(summ <- data.frame(summary(fit)$coefficients))
#> # A tibble: 2 × 4
#> Estimate Std..Error z.value Pr...z..
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.575 0.295 -1.95 0.0508
#> 2 1.42 0.427 3.33 0.000855
#> [1] -1.152824 2.124464
The computation is a little bit different and they are not always symmetric:
# profile likelihood, different from wald type
confint(fit)
#> 2.5 % 97.5 %
#> (Intercept) -1.1726621 -0.00947434
#> tv_shows 0.6034638 2.28247024
You can use the plot_param()
function (quite overkilled):
plot_param(fit, "tv_shows")
When calculating confidence intervals it is important to consider the link function. In the same way as we compute the inverse logit function on the parameter value, we could revert also the confidence intervals. IMPORTANT, do not apply the inverse logit on the standard error and then compute the confidence interval.
fits <- broom::tidy(fit) # extract parameters as dataframe
fits
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -1.15 0.331 -3.48 0.000500
#> 2 tv_shows 1.73 0.443 3.90 0.0000967
The same principle holds for predicted probabilities. First compute the intervals on the logit scale and then transform-back on the probability scale:
fits <- dat |>
select(tv_shows) |>
distinct() |>
add_predict(fit, se.fit = TRUE)
fits$p <- plogis(fits$fit)
fits$lower <- plogis(with(fits, fit - 2*se.fit))
fits$upper <- plogis(with(fits, fit + 2*se.fit))
fits
#> # A tibble: 2 × 7
#> tv_shows fit se.fit residual.scale p lower upper
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.575 0.295 1 0.640 0.497 0.762
#> 2 0 -1.15 0.331 1 0.240 0.140 0.380
With multiple predictors, especially with categorical variables with more than 2 levels, we can compute the an anova-like analysis individuating the effect of each predictor. In R we can do this using the car::Anova()
function. Let’s simulate a model with a 2x2 interaction:
#> # A tibble: 6 × 4
#> id x1 x2 y
#> <int> <fct> <fct> <int>
#> 1 1 a c 1
#> 2 2 a d 0
#> 3 3 b c 1
#> 4 4 b d 1
#> 5 5 a c 1
#> 6 6 a d 0
We can fit the most complex model containing the two main effects and the interaction. I set the contrasts for the two factors as contr.sum()/2
that are required for a proper analysis of factorial designs:
fit_max <- glm(y ~ x1 + x2 + x1:x2, data = dat, family = binomial(link = "logit")) # same as x1 * x2
summary(fit_max)
#>
#> Call:
#> glm(formula = y ~ x1 + x2 + x1:x2, family = binomial(link = "logit"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.1213 -0.9005 -0.5350 1.2346 2.0074
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.8864 0.2138 -4.147 3.37e-05 ***
#> x11 0.7921 0.4275 1.853 0.0639 .
#> x21 0.9462 0.4275 2.213 0.0269 *
#> x11:x21 -0.4649 0.8551 -0.544 0.5867
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 148.26 on 119 degrees of freedom
#> Residual deviance: 139.86 on 116 degrees of freedom
#> AIC: 147.86
#>
#> Number of Fisher Scoring iterations: 4
Then using car::Anova()
. For each effect we have the \(\chi^2\) statistics and the associated p-value. The null hypothesis is that the specific factor did not contribute in reducing the residual deviance.
car::Anova(fit_max)
#> # A tibble: 3 × 3
#> `LR Chisq` Df `Pr(>Chisq)`
#> <dbl> <dbl> <dbl>
#> 1 3.32 1 0.0683
#> 2 4.92 1 0.0266
#> 3 0.299 1 0.584
The table obtained with car::Anova()
is essentially a model comparison using the Likelihood Ratio test. This can be done using the anova(...)
function.
\[\begin{align*} D = 2 (log(\mathcal{L}_{full}) - log(\mathcal{L}_{reduced})) \\ D \sim \chi^2_{df_{full} - df_{reduced}} \end{align*}\]
To better understanding, the x2
effect reported in the car::Anova()
table is a model comparison between a model with y ~ x1 + x2
and a model without x2
. The difference between these two model is the unique contribution of x2
after controlling for x1
:
fit <- glm(y ~ x1 + x2, data = dat, family = binomial(link = "logit"))
fit0 <- glm(y ~ x1, data = dat, family = binomial(link = "logit"))
anova(fit0, fit, test = "LRT") # ~ same as car::Anova(fit_max)
#> # A tibble: 2 × 5
#> `Resid. Df` `Resid. Dev` Df Deviance `Pr(>Chi)`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 118 145. NA NA NA
#> 2 117 140. 1 4.92 0.0266
The model comparison using anova()
(i.e., likelihood ratio tests) is limited to nested models thus models that differs only for one term. For example:
fit1
and fit2
are non-nested because we have the same number of predictors (thus degrees of freedom). fit3
and fit1
/fit2
are nested because fit3
is more complex and removing one term we can obtain the less complex models.
anova(fit1, fit2, test = "LRT") # do not works properly
#> # A tibble: 2 × 5
#> `Resid. Df` `Resid. Dev` Df Deviance `Pr(>Chi)`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 118 145. NA NA NA
#> 2 118 143. 0 1.59 NA
anova(fit1, fit3, test = "LRT") # same anova(fit2, fit3)
#> # A tibble: 2 × 5
#> `Resid. Df` `Resid. Dev` Df Deviance `Pr(>Chi)`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 118 145. NA NA NA
#> 2 117 140. 1 4.92 0.0266
As for standard linear models I can use the Akaike Information Criteria (AIC) or the Bayesian Information Criteria (BIC) to compare non-nested models. The downside is not having a properly hypothesis testing setup.
data.frame(BIC(fit1, fit2, fit3)) |>
arrange(BIC)
#> # A tibble: 3 × 2
#> df BIC
#> <dbl> <dbl>
#> 1 2 153.
#> 2 3 155.
#> 3 2 155.
data.frame(AIC(fit1, fit2, fit3)) |>
arrange(AIC)
#> # A tibble: 3 × 2
#> df AIC
#> <dbl> <dbl>
#> 1 3 146.
#> 2 2 147.
#> 3 2 149.
For post-hoc contrasts you can use the emmeans
package both for numerical but especially categorical predictors. Let’s simulate a 2x2 interaction:
dat <- sim_design(100, cx = list(x1 = c("a", "b"), x2 = c("c", "d")), contrasts = contr.sum2)
dat <- sim_data(dat, plogis(qlogis(0.25) + 0 * x1_c + 2 * x2_c + 2 * x1_c * x2_c))
fit <- glm(y ~ x1 * x2, data = dat, family = binomial(link = "logit"))
summary(fit)
#>
#> Call:
#> glm(formula = y ~ x1 * x2, family = binomial(link = "logit"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.5096 -0.6866 -0.2468 0.8782 2.6482
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1449 0.1755 -6.524 6.86e-11 ***
#> x11 -0.4326 0.3510 -1.232 0.218
#> x21 2.5113 0.3510 7.155 8.37e-13 ***
#> x11:x21 3.4372 0.7020 4.896 9.76e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 502.99 on 399 degrees of freedom
#> Residual deviance: 386.90 on 396 degrees of freedom
#> AIC: 394.9
#>
#> Number of Fisher Scoring iterations: 6
emmeans
is a very powerful and complicate package (documentation). Firstly we can estimate the marginal means of the conditions:
#> x1 x2 emmean SE df asymp.LCL asymp.UCL
#> a c 0.754 0.214 Inf 0.334 1.174
#> b c -0.532 0.207 Inf -0.938 -0.126
#> a d -3.476 0.586 Inf -4.625 -2.327
#> b d -1.325 0.246 Inf -1.806 -0.844
#>
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95
We can also calculate the marginal means on the response scale:
#> x1 x2 prob SE df asymp.LCL asymp.UCL
#> a c 0.68 0.0466 Inf 0.58264 0.7639
#> b c 0.37 0.0483 Inf 0.28127 0.4685
#> a d 0.03 0.0171 Inf 0.00971 0.0889
#> b d 0.21 0.0407 Inf 0.14111 0.3008
#>
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
Crucially we can compute the contrasts (i.e., the post-hoc tests). Notice that they are computed on the link-function scale:
emmeans(fit, pairwise ~ x1 | x2)$contrasts
#> x2 = c:
#> contrast estimate SE df z.ratio p.value
#> a - b 1.29 0.298 Inf 4.314 <.0001
#>
#> x2 = d:
#> contrast estimate SE df z.ratio p.value
#> a - b -2.15 0.636 Inf -3.385 0.0007
#>
#> Results are given on the log odds ratio (not the response) scale.
emmeans(fit, pairwise ~ x1 | x2, type = "response")$contrasts
#> x2 = c:
#> contrast odds.ratio SE df null z.ratio p.value
#> a / b 3.618 1.0786 Inf 1 4.314 <.0001
#>
#> x2 = d:
#> contrast odds.ratio SE df null z.ratio p.value
#> a / b 0.116 0.0739 Inf 1 -3.385 0.0007
#>
#> Tests are performed on the log odds ratio scale
Notice the order of the terms to change the actual type of contrasts:
emmeans(fit, pairwise ~ x2 | x1)$contrasts
#> x1 = a:
#> contrast estimate SE df z.ratio p.value
#> c - d 4.230 0.624 Inf 6.777 <.0001
#>
#> x1 = b:
#> contrast estimate SE df z.ratio p.value
#> c - d 0.793 0.321 Inf 2.468 0.0136
#>
#> Results are given on the log odds ratio (not the response) scale.
emmeans(fit, pairwise ~ x2 | x1, type = "response")$contrasts
#> x1 = a:
#> contrast odds.ratio SE df null z.ratio p.value
#> c / d 68.71 42.89 Inf 1 6.777 <.0001
#>
#> x1 = b:
#> contrast odds.ratio SE df null z.ratio p.value
#> c / d 2.21 0.71 Inf 1 2.468 0.0136
#>
#> Tests are performed on the log odds ratio scale
When plotting a binomial GLM the most useful way is plotting the marginal probabilities and standard errors/confidence intervals for a given combination of predictors. Let’s make an example for:
A general workflow could be:
predict()
function giving the grid of values on which computing predictionsEverything can be simplified using some packages to perform each step and returning a plot:
allEffects()
from the effects()
package (return a base R plot)ggeffect()
from the ggeffect()
package (return a ggplot2
object)plot_model
from the sjPlot
package (similar to ggeffect()
)In this situation we can just plot the marginal probabilities for each level of the categorical predictor. Plotting our exam dataset:
allEffects()
ggeffect()
/plot_model()
Sometimes could be useful to plot the estimated sampling distribution of a coefficient. For example, we can plot the tv_shows
effect on our example. I’ve written the plot_param()
function that directly create a basic-plot given the model and the coefficient name. The plot highlight the null value and the 95% Wald confidence interval.
The diagnostic for GLM is similar to standard linear models. Some areas are more complicated for example residual analysis and goodness of fit. We will see:
The likelihood is the joint probability of the observed data viewed as a function of the parameters.
\[ \log \mathcal{L}(\mu|x) = \sum^n_{i = 1} \log(\mu|x_i) \]
x <- rnorm(10) # data from normal distribution
# data
x
#> [1] -1.8694218 0.1218114 -1.3832446 0.9884026 -0.4416857 -0.3099413
#> [7] -0.8407464 -1.8613016 0.2581889 -0.1618937
# the model is a normal distribution with mu = 0 and sd = 1
dnorm(x, 0, 1)
#> [1] 0.06950842 0.39599347 0.15325971 0.24477683 0.36186586 0.38023327
#> [7] 0.28016802 0.07056927 0.38586439 0.39374833
# log likelihood
sum(log(dnorm(x, 0, 1)))
#> [1] -14.66699
By summing the logarithm all the red segments we obtain the log likelihood of the model given the observed data.
In the previous slide we tried only 3 values for the mean. Let’s image to calculate the log likelihood for several different means. The parameter with that is associated with highest likelihood is called the maximum likelihood estimator. In fact, the \(\mu = 0\) is associated with the highest likelihood.
The (residual) deviance in the context of GLM can be considered as the distance between the current model and a perfect model (often called saturated model).
\[ D_{res} = -2[\log(\mathcal{L}_{current}) - (\log\mathcal{L}_{sat})] \]
Where \(\mathcal{L}\) is the likelihood of the considered model (see the previous slides). Clearly, the lower the deviance, the closer the current model to the perfect model suggesting a good fit.
The saturated model is a model where each observation \(x_i\) has a parameter.
Another important quantity is the null deviance that is expressed as the distance between the null model and the saturated model.
\[ D_{null} = -2[\log(\mathcal{L}_{null}) - (\log\mathcal{L}_{sat})] \]
The null deviance can be interpreted as the maximal deviance because is estimated using a model without predictors. A good model will have a residual deviance lower than the null model.
When we perform a likelihood ratio test (see previous slides) we are essentially comparing the residual deviance (or the likelihood) of two models.
With the null model clearly the residual deviance is the same as the null deviance because we are not using predictors to reduce the deviance.
summary(fit_null)
#>
#> Call:
#> glm(formula = y ~ 1, family = binomial(link = "logit"), data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.044 -1.044 -1.044 1.317 1.317
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.3228 0.2865 -1.126 0.26
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 68.029 on 49 degrees of freedom
#> Residual deviance: 68.029 on 49 degrees of freedom
#> AIC: 70.029
#>
#> Number of Fisher Scoring iterations: 4
If the predictor x
is useful in explaining y
the residual deviance will be reduced compared to the null (overall deviance).
summary(fit_current)
#>
#> Call:
#> glm(formula = y ~ x, family = binomial(link = "logit"), data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.9121 -0.4071 -0.2063 0.6259 1.7336
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -5.045 1.470 -3.432 0.000599 ***
#> x 9.535 2.659 3.586 0.000336 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 68.029 on 49 degrees of freedom
#> Residual deviance: 36.745 on 48 degrees of freedom
#> AIC: 40.745
#>
#> Number of Fisher Scoring iterations: 6
Comparing the two models we can understand if the deviance reduction can be considered statistically significant.
anova(fit_null, fit_current, test = "LRT")
#> # A tibble: 2 × 5
#> `Resid. Df` `Resid. Dev` Df Deviance `Pr(>Chi)`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 49 68.0 NA NA NA
#> 2 48 36.7 1 31.3 0.0000000223
Notice that the difference between the two residual deviances is the test statistics that is distributed as a \(\chi^2\) with \(df = 1\).
Let’s fit the three models:
We can calculate the residual deviance:
We can calculate the null deviance:
Compared to the standard linear regression, there are multiple ways to calculate an \(R^2\) like measure for GLMs and there is no consensus about the most appropriate method. There are some useful resources:
To note, some measures are specific for the binomial GLM while other measures can be applied also to other GLMs (e.g., the poisson)
We will se:
The McFadden’s pseudo-\(R^2\) compute the ratio between the log-likelihood of the intercept-only (i.e., null) model and the current model(McFadden, 1987):
\[ R^2 = 1 - \frac{\log(\mathcal{L_{current}})}{\log(\mathcal{L_{null}})} \]
There is also the adjusted version that take into account the number of parameters of the model. In R can be computed manually or using the performance::r2_mcfadden()
:
performance::r2_mcfadden(fit)
#> # R2 for Generalized Linear Regression
#> R2: 0.034
#> adj. R2: 0.018
This measure is the easiest to interpret and calculate but can only be applied for binomial binary models (Tjur, 2009). Is the absolute value of the difference between the proportions of correctly classifying \(y = 1\) and \(y = 0\) from the model:
\[\begin{align*} p_1 = p(y_i = 1 |\hat y_i = 1) \\ p_2 = p(y_i = 0 |\hat y_i = 0) = 1 - p_1 \\ R^2 = |p_1 - p_2| \end{align*}\]
performance::r2_tjur(fit2)
#> Tjur's R2
#> 0.5654064
The main problem of GLM is that the mean and the variance are linked. For example, the mean of a binomial distribution is \(np\) and the variance is \(np(1-p)\). In the standard linear model we have two parameters, \(\mu\) and the residual \(\sigma\) that are independent.
When using a linear model (thus assuming a constant \(\sigma\) and independence between mean and variance) the residuals pattern is very different:
As for standard linear models there are different types of residuals:
Raw residuals, also called response residuals are the simplest type of residuals. They are calculated as in standard regression as:
\[\begin{align*} r_i = y_i - \hat y_i \end{align*}\]
Where \(\hat y_i\) are the fitted values where the inverse of the link function has been applied.
In R:
# equivalent to residuals(fit, type = "response")
ri <- fit$y - fitted(fit)
ri[1:5]
#> 1 2 3 4 5
#> 0.56826521 0.18039223 0.07097155 0.56848124 0.04399144
The problem is that in GLMs the mean and the variance of the distribution are not independent, creating problems in residual analysis.
We can use the function car::residualPlot()
to plot different type of residuals against fitted values:
car::residualPlot(fit, type = "response")
This plot1 shows an example with the same residual for two different \(x\) values on a Poisson GLM. Beyond the model itself, the same residual can be considered as extreme for low \(x\) values and plausible for high \(x\) values:
Gelman and colleagues Gelman & Hill (2006) proposed a type of residuals called binned residuals to solve the problem of the previous plot for Binomial GLMs:
We can use the arm::binnedplot()
function to automatically create and plot the binned residuals:
Pearson residuals are raw residuals divided by the standard deviation of each residual. Given that the mean-variance relationship of GLMs, dividing by the standard deviation partially solve the mean-variance relationship. The denominator can be calculated just using the appropriate variance formula for that specific GLM.
\[ r_i = \frac{y_i - \hat y_i}{\sqrt{V(\hat y_i)}} \]
We can see the difference for a Poisson model1 when using raw vs pearson residuals. The non-constant variance is controlled on the right.
For the Binomial (Bernoulli) GLM, the variance is calculated as \(\hat p(1 - \hat p)\) where \(\hat p\) is the residual value.
\[ r_i = \frac{y_i - \hat y_i}{\sqrt{\hat y_i(1 - \hat y_i)}} \]
Deviance residuals are based on the residual deviance that we defined before. In fact, the residual deviance was just the sum of squared deviance residuals.
\[ r_i = sign(y_i - \hat y_i) \sqrt{-2[\log \mathcal{L}y_{i_{current}} - \log \mathcal{L}y_{i_{saturated}}]} \] Where \(sign\) is the sign of the raw residual \(i\). For the Bernoulli model the \(\log \mathcal{L}y_{i_{saturated}}\) is always 0.
To calculate and demonstrate manually the deviance residuals we can compute them manually in R:
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ri_dev[1:5]
#> 1 2 3 4 5
#> -0.2370588 0.3033530 -1.0310244 0.2331303 -0.8429485
residuals(fit, type = "deviance")[1:5]
#> 1 2 3 4 5
#> -0.2370588 0.3033530 -1.0310244 0.2331303 -0.8429485
The hat matrix \(H\) is calculated as \(H = X \left(X^{\top} X \right)^{-1} X^{\top}\) is a \(n \times n\) matrix where \(n\) is the number of observations. The diagonal of the \(H\) matrix contained the hat values or leverages.
The \(i^{th}\) leverage score (\(h_{ii}\)) is interpreted as the weighted distance between \(x_i\) and the mean of \(x_i\)’s. In practical terms is the \(i^{th}\) observed value influence the \(i^{th}\) fitted value. An high leverage suggest that the observation is far from the mean of predictors and have an high influence on the fitted values.
For a simple linear regression (\(y \sim x\)) the hat values are calculated as:
\[ h_i = \frac{1}{n} + \frac{(X_i - \bar X)^2}{\sum^n_{j = 1}(X_i - \bar X)^2} \]
In R the function hatvalues()
return the diagonal of the \(H\) matrix for glm
and lm
:
hatvalues(fit)[1:10]
#> 1 2 3 4 5 6 7
#> 0.02874015 0.03492581 0.03968557 0.02782301 0.04357230 0.04652508 0.04435804
#> 8 9 10
#> 0.04012613 0.03948651 0.04743780
To note, for GLM and multiple regression in general, the equation is different and more complicated but the intepretation and the R implementation is the same.
Both the response, pearson and deviance residuals can be considered as raw residuals. We can standardize residuals by dividing for the standard error computed with the hat values. In this way, the distribution will be approximately normal with \(\mu = 0\) and \(\sigma = 1\).
\[ r_{s_i} = \frac{r_i}{\sqrt{(1 - \hat h_{ii})}} \]
Where \(r_i\) can be raw, pearson or deviance residuals.
In R they can be extracted using rstandard()
:
We can try to manually calculate the residuals to better understand.
yhat <- fitted(fit) # fitted
yi <- fit$y # observed
hi <- hatvalues(fit) # diagonal of the hat matrix
# pearson residuals
pi <- (yi - yhat) / sqrt(yhat * (1 - yhat))
# standardized
pis <- pi / sqrt(1 - hi)
pis[1:5]
#> 1 2 3 4 5
#> -0.1712897 0.2208858 -0.8546823 0.1683326 -0.6678440
rstandard(fit, type = "pearson")[1:5]
#> 1 2 3 4 5
#> -0.1712897 0.2208858 -0.8546823 0.1683326 -0.6678440
# deviance (for binomial GLM the loglik of the saturated model is 0)
di <- sign(yi - yhat) * sqrt(-2*log(dbinom(y, 1, yhat)))
# standardized
dis <- di / sqrt(1 - hi)
dis[1:5]
#> 1 2 3 4 5
#> -0.2405406 0.3087934 -1.0521126 0.2364427 -0.8619359
rstandard(fit, type = "deviance")[1:5]
#> 1 2 3 4 5
#> -0.2405406 0.3087934 -1.0521126 0.2364427 -0.8619359
The quantile residuals is another proposal for residual analysis. The idea is to map the quantile of the cumulative density function (CDF) of the random component into the CDF of the normal distribution.
Quantile residuals are very useful especially for Discrete GLMs (binomial and poisson) and are exactly normally distributed (under respected model assumptions) compared to deviance and pearson residuals (Dunn & Smyth, 1996). They can be calculated using the statmod::qresid(fit)
function. Authors suggest to run the function 4 times to disentagle between the randomization and the systematic component.
Residuals for GLMs are complicated. To have a more detailed and clear overview see:
The error rate (ER) is defined as the proportion of cases for which the deterministic prediction i.e. guessing \(y_i = 1\) if \(logit^{-1}(\hat y_i) > 0.5\) and guessing \(y_i = 0\) if \(logit^{-1}(\hat y_i) > 0.5\) is wrong. Clearly, \(1 - ER\) is the classification accuracy.
I wrote the error_rate
function that simply compute the error rate of a given model:
error_rate(fit)
#> [1] 0.14
We could compare the error rate of a given model with the error rate of the null model or another similar model (with a model comparison approach):
fit0 <- update(fit, . ~ -x) # removing the x predictor, now intercept only model
error_rate(fit0)
#> [1] 0.48
# the error rate of the null model is ... greater/less than the actual model
er_ratio = error_rate(fit0)/error_rate(fit)
The error rate of the null model is 3.429 times greater than the actual model.
For a given model, the error rate should be less than \(0.5\), otherwise setting all \(\beta\)’s to 0 (i.e., null model) would be considered a better model (Gelman et al., 2020).
For example if the average \(p\) in a dataset is \(\overline p = 0.3\) means that a null model would have an error rate of \(0.3\) and a classification accuracy of \(1 - \overline p = 0.7\).
Including predictors we aim to reduce the error rate (compared to the null model) because the ability to correctly classify an observation increase.
The error rate can be misleading, see Gelman et al. (2020) (pp. 255-256)
Identification of influential observation and outliers of GLMs is very similar to standard regression models. We will briefly see:
The Cook Distance of an observation \(i\) measured the impact of that observation on the overall model fit. If removing the observation \(i\) has an high impact, the observation \(i\) is likely an influential observation. For GLMs they are defined as:
\[\begin{align*} D_i = \frac{r_i^2}{\phi p} \frac{h_{ii}}{1 - h_{ii}} \end{align*}\]
Where \(p\) is the number of model parameters, \(r_i\) are the standardized pearson residuals (rstandard(fit, type = "pearson")
) and \(h_{ii}\) are the hatvalues (leverages). \(\phi\) is the dispersion parameter of the GLM that for binomial and poisson models is fixed to 1 (see Dunn (2018, Table 5.1)) Usually an observation is considered influential if \(D_i > \frac{4}{n}\) where \(n\) is the sample size.
DFBETAs measure the impact of the observation \(i\) on the estimated parameter \(\beta_j\):
\[\begin{align*} DFBETAS_i = \frac{\beta_j - \beta_{j(i))}}{\sigma_{\beta_{j(i)}}} \end{align*}\]
Where \(i\) denote the parameters and standard error on a model fitted without the \(i\) observation1. Usually an observation is considered influential if \(|DFBETAs_{i}| > \frac{2}{\sqrt{n}}\) where \(n\) is the sample size.
In R we can use the influence.measures()
function to calculate each influence measure explained before1.
fit <- update(fit, data = fit$model[1:50, ])
infl <- influence.measures(fit)$infmat
head(infl)
#> dfb.1_ dfb.x dffit cov.r cook.d hat
#> 1 -0.04419414 0.04109800 -0.04442749 1.070827 0.0004340963 0.02874015
#> 2 -0.04672053 0.05681844 0.06310261 1.075710 0.0008828590 0.03492581
#> 3 -0.11865551 0.05713533 -0.23265873 1.028011 0.0150937972 0.03968557
#> 4 -0.03348170 0.03972704 0.04294688 1.069920 0.0004054759 0.02782301
#> 5 -0.14588644 0.10150547 -0.19921442 1.051194 0.0101596403 0.04357230
#> 6 -0.07575414 0.10700698 0.14303484 1.074641 0.0048052405 0.04652508
The first two columns are the DFBETAs for each parameter, the cook.d
columns contains the Cook Distances and hat
is the diagonal of the \(H\) matrix.
I wrote the cook_plot()
function to easily plot the cook distances along with the identification of influential observations:
cook_plot(fit)
I wrote the dfbeta_plot()
function to easily plot the cook distances along with the identification of influential observations:
dfbeta_plot(fit)
There are several practical differences between binomial and binary models:
The most basic Binomial regression is a vector of binary \(y\) values and a continuous or categorical predictor. Let’s see a common data structure in this case:
#> # A tibble: 9 × 3
#> id x y
#> <chr> <chr> <chr>
#> 1 1 0.85 1
#> 2 2 0.16 0
#> 3 3 0.08 0
#> 4 4 0.17 0
#> 5 ... ... ...
#> 6 27 0.52 1
#> 7 28 0.64 1
#> 8 29 0.11 0
#> 9 30 0.54 1
Or equivalently with a categorical variable:
n <- 15
x <- c("a", "b")
dat <- sim_design(n, cx = list(x = x))
b0 <- qlogis(0.4)
b1 <- log(odds_ratio(0.7, 0.4))
dat |>
sim_data(plogis(b0 + b1*x_c), "binomial") |>
dplyr::select(-x_c, -lp) |>
filor::trim_df(4)
#> # A tibble: 9 × 3
#> id x y
#> <chr> <chr> <chr>
#> 1 1 a 1
#> 2 2 b 0
#> 3 3 a 0
#> 4 4 b 0
#> 5 ... ... ...
#> 6 27 a 0
#> 7 28 b 1
#> 8 29 a 0
#> 9 30 b 0
When using a Binomial data structure we count the number of success for each level of \(x\). nc
is the number of 1 responses, nf
is the number of 0 response out of nt
trials:
#> # A tibble: 9 × 4
#> x nc nf nt
#> <chr> <chr> <chr> <chr>
#> 1 0 0 10 10
#> 2 0.05 0 10 10
#> 3 0.1 1 9 10
#> 4 0.15 0 10 10
#> 5 ... ... ... ...
#> 6 0.85 9 1 10
#> 7 0.9 8 2 10
#> 8 0.95 9 1 10
#> 9 1 10 0 10
With a categorical variable we have essentially a contingency table:
Clearly, expanding or aggregating data is the way to convert a binary into a binomial data structure and the opposite:
# from binomial to binary
bin_to_binary(datc, nc, nt) |>
select(y, x) |>
filor::trim_df()
#> # A tibble: 9 × 2
#> y x
#> <chr> <chr>
#> 1 1 a
#> 2 1 a
#> 3 1 a
#> 4 1 a
#> 5 ... ...
#> 6 0 b
#> 7 0 b
#> 8 0 b
#> 9 0 b
# from binary to binomial
bin_to_binary(datc, nc, nt) |>
binary_to_bin(y, x)
#> # A tibble: 2 × 4
#> x nc nf nt
#> <chr> <dbl> <dbl> <int>
#> 1 a 5 5 10
#> 2 b 4 6 10
Clearly, in the previous examples, each row correspond to a single observation (in the binary model) or a condition (in the binomial model). Furthermore each row/observation/participants is assumed to be independent.
If participants performed multiple trials for each condition, we can still use a binomial or binary model BUT we need to take into account the multilevel structure.
In practice, we need to use a mixed-effects GLM, for example with the lme4::glmer()
package specifying the appropriate random structure.
There is a single way to implement a binary model in R and two ways for the binomial version:
# binary regression
glm(y ~ x, family = binomial(link = "logit"))
# binomial with cbind syntax, nc = number of 1s, nf = number of 0s, nc + nf = nt
glm(cbind(nc, nf) ~ x, family = binomial(link = "logit"))
# binomial with proportions and weights, equivalent to the cbind approach, nt is the total trials
glm(nc/nt ~ x, weights = nt, binomial(link = "logit"))
A more relevant difference is about the residual analysis. The binary regression has different residuals compared to the binomial model fitted on the same dataset1.
The residual deviance is also different. In fact, there is more residual deviance on the binary compared to the binomial model. However, comparing two binary and binomial models actually leads to the same conclusion. In other terms the deviance seems to be on a different scale1:
In fact, if we compare the full model with the null model in both binary and binomial version, the LRT is the same but the deviance is on a different scale. Comparing a binary with a binomial model could be completely misleading.
anova(fit_binomial, fit0_binomial, test = "Chisq")
#> # A tibble: 2 × 5
#> `Resid. Df` `Resid. Dev` Df Deviance `Pr(>Chi)`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 19 14.4 NA NA NA
#> 2 20 148. -1 -133. 8.27e-31
anova(fit_binary, fit0_binary, test = "Chisq")
#> # A tibble: 2 × 5
#> `Resid. Df` `Resid. Dev` Df Deviance `Pr(>Chi)`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 208 156. NA NA NA
#> 2 209 289. -1 -133. 8.27e-31
This point is less relevant in this course but important in general. Usually, binary regression is used when the predictor is at the trial level whereas binomial regression is used when the predictor is at the participant level. When both levels are of interests one should use a mixed-effects model where both levels can be modeled.
The probability of correct responses during an exam as a function of the question difficulty
#> # A tibble: 9 × 4
#> id question difficulty y
#> <chr> <chr> <chr> <chr>
#> 1 1 1 1 1
#> 2 1 2 3 1
#> 3 1 3 3 1
#> 4 1 4 1 0
#> 5 ... ... ... ...
#> 6 1 27 1 1
#> 7 1 28 5 1
#> 8 1 29 1 0
#> 9 1 30 1 1
the probability of passing the exam as a function of the high-school background
#> # A tibble: 4 × 4
#> background nc nf nt
#> <chr> <chr> <chr> <chr>
#> 1 math 20 10 30
#> 2 chemistry 16 4 20
#> 3 art 4 6 10
#> 4 sport 15 5 20
Or the binary version:
#> # A tibble: 9 × 5
#> y background nc nf nt
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 math 20 10 30
#> 2 1 math 20 10 30
#> 3 1 math 20 10 30
#> 4 1 math 20 10 30
#> 5 ... ... ... ... ...
#> 6 0 sport 15 5 20
#> 7 0 sport 15 5 20
#> 8 0 sport 15 5 20
#> 9 0 sport 15 5 20
When using the probit link the parameters are interpreted as difference in z-scores associated with a unit increase in the predictors. In fact probabilities are mapped into z-scores using the cumulative normal distribution.