performance::r2_mcfadden(fit)Psychological Sciences
Filippo Gambarota
University of Padova
Last modified: 08-12-2025
The diagnostic for GLM is similar to standard linear models. Some areas are more complicated for example residual analysis and goodness of fit. We will see:
Compared to the standard linear regression, there are multiple ways to calculate an \(R^2\) like measure for GLMs and there is no consensus about the most appropriate method. There are some useful resources:
To note, some measures are specific for the binomial GLM while other measures can be applied also to other GLMs (e.g., the poisson)
We will se:
The McFadden’s pseudo-\(R^2\) compute the ratio between the log-likelihood of the intercept-only (i.e., null) model and the current model(McFadden, 1987):
\[ R^2 = 1 - \frac{\log(\mathcal{L_{current}})}{\log(\mathcal{L_{null}})} \]
There is also the adjusted version that take into account the number of parameters of the model. In R can be computed manually or using the performance::r2_mcfadden():
performance::r2_mcfadden(fit)This measure is the easiest to interpret and calculate but can only be applied for binomial binary models (Tjur, 2009). Is the absolute value of the difference between the proportions of correctly classifying \(y = 1\) and \(y = 0\) from the model:
\[ \begin{align*} \pi_0 = \frac{1}{n_0} \sum_{i = 1}^{n_0} \hat p_i \\ \pi_1 = \frac{1}{n_1} \sum_{i = 1}^{n_1} \hat p_i \\ R^2 = |\pi_1 - \pi_0| \end{align*} \]
performance::r2_tjur(fit)Residuals in regression models represent the deviation of each observed value \(y_i\) from the fitted value \(\mu_i\) (remember that \(\mu_i = g^{-1}(\eta_i)\)).
This means that a large residuals (depending on the \(y\) scale and the expected error variance) indicate a problem with the model and/or with that specific observation.
We can identify three types of residuals:
Is the basic type of residual, that is very common in Gaussian linear models:
\[ r_i = y_i - \mu_i \]
With \(\mu_i = g^{-1}(\eta_i)\). That is the observed value in the raw scale minus the predicted value from the model in the raw scale (i.e., after inverting the link function).
The problem with raw residuals is that, given the mean-variance relationship the same distance from the fitted value is intepreted differently depending on the fitted value itself.
In standard linear models, \(\mu\) and \(\sigma^2\) are independent and \(\sigma^2\) is constant. This means that for each \(\mu_i\) the expected variance is always \(\sigma^2\).
In non Gaussian GLMs, the variance increases with the mean. For example in Poisson models, \(\mbox{var}(\mu_i) = \mu_i\).
This plot1 shows an example with the same residual for two different \(x\) values on a Poisson GLM. Beyond the model itself, the same residual can be considered as extreme for low \(x\) values and plausible for high \(x\) values:
To take into account the mean-variance relationship we can divide each raw residual by the expected variance at that specific level:
\[ r_{P_i} = \frac{r_i}{\sqrt{\mbox{var}(\mu_i)}} \]
This is in fact, reducing the residuals when the variance is large, stabilizing the mean-variance relationship.
plot about before and after stabilizing
Deviance residuals are similar to the Pearson residuals:
\[ r_{Ð_i} = \mbox{sign}(y_i - \mu_i) \sqrt{d(y_i, \mu_i)} \]
Deviance residuals are the default in R when using the residuals() function. When using lm (i.e., the Gaussian GLM) raw residuals are computed by default.
This is a less common type of residuals but very promising. Dunn & Smyth (1996) is the first paper proposing the idea and expanded with examples in Dunn & Smyth (2018).
The idea is the following:
If everything is well specified (random component, link function, systematic component, etc.), the residuals are normally distributed.
The quantile residuals are particularly useful with discrete responses (e.g., Poisson or Binomial) where other residuals patterns could be confounded by the nature of the variable.
The process of calculating quantile residuals for discrete random variables is a little bit more complex but clearly described in Dunn & Smyth (2018, pp. 301–304).
\[ r_{Q_i} = \Phi^{-1}\{\mathcal{F}\;(y_i\;; \mu_i, \phi)\} \]
Where \(\Phi\) is the CDF of the standard normal distribution (qnorm in R)
Let’s use a Gamma distribution as an example:
We want to see if y is related to x, a variable between 0 and 1:
scatter.smooth(x, y, pch = 19, col = scales::alpha("black", 0.2), main = "y ~ x")Probably yes!
Assume that we know is a Gamma distribution thus we use a Gamma GLM:
#>
#> Call:
#> glm(formula = y ~ x, family = Gamma(link = "log"))
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.90330 0.02837 137.57 <2e-16 ***
#> x 2.00596 0.04876 41.14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Gamma family taken to be 0.09556536)
#>
#> Null deviance: 202.939 on 499 degrees of freedom
#> Residual deviance: 48.715 on 498 degrees of freedom
#> AIC: 5142.7
#>
#> Number of Fisher Scoring iterations: 4
Firsly let’s use the Gamma CDF that is pgamma in R to compute the cumulative probability of each \(y_i\):
The residuals can be directly calculated using the statmod::qresid() function (the package from the Dunn & Smyth (2018) book) and they match our calculation:
#> 1 2 3 4 5 6 7
#> 0.1641196 -0.4459411 -1.4659977 -0.6320657 0.6347470 1.1372092 -1.0610529
#> 8 9 10
#> 0.5660843 0.6160270 1.0011647
z[1:10] # our manual calculation#> [1] 0.1641196 -0.4459411 -1.4659977 -0.6320657 0.6347470 1.1372092
#> [7] -1.0610529 0.5660843 0.6160270 1.0011647
DHARMa packageThe DHARMa package is a modern R package based on the idea of Quantile residuals. The package supports several models including (generalized) linear mixed-effects models.
In fact, the package used a simulation-based version of Quantile residuals but the idea is very similar. See the documentation for more details.
library(DHARMa)
plot(simulateResiduals(fit))DHARMa packageAll the previous types of residuals can be considered raw. We can compute the standardized version of the previous residuals by using the so-called hatvalues. The matrix algebra is beyond my expertise so I can give you the intuition about the hat matrix.
The hat matrix \(\mathop{\mathbf{H}}\underset{n \times n}{}\) is calculated as \(\mathbf{H} = \mathbf{X} \left(\mathbf{X}^{\top} \mathbf{X} \right)^{-1} \mathbf{X}^{\top}\) where \(\mathop{\mathbf{X}}\underset{n \times p'}{}\) is the model matrix.
The hatvalues \(h_{ii}\) (\(i = 1, \dots, n\)) are the diagonal elements of \(\mathop{\mathbf{H}}\underset{n \times n}{}\).
The hatvalues \(h_{ii}\) represent the influence of the \(i\) observation on the fitted value \(\mu_i\). A large hatvalue means that this specific observation has predictors with a large influence on the fitted value. They are also called leverage points because they influence the regression line.
Equation for hatvalues in simple linear regression (\(\mu_i = \beta_0 + \beta_1x_{1i}\)) reduces to:
\[ h_i = \frac{1}{n} + \frac{(x_i - \bar x)^2}{\sum^n_{j = 1}(x_i - \bar x)^2} \]
We use the simple linear regression as a toy example to have an intuition about hatvalues.
The first term is the contribution of the intercept:
\[ h_i = {\color{red}{\frac{1}{n}}} + \frac{(x_i - \bar x)^2}{\sum_{j=1}^n (x_j - \bar x)^2} \]
If no predictors are added into the model there are no leverage points. Each datapoint contribute equally. We have \(p' = 1\) so each observation has \(p'/n\) influence.
The second term is the contribution of the slope, the effect of \(x\):
\[ h_i = \frac{1}{n} + {\color{red}{\frac{(x_i - \bar x)^2}{\sum_{j=1}^n (x_j - \bar x)^2}}} \]
The equation is just the squared residual of \(x_i\) from the mean, rescaled by the total sum of squares. How far is \(x_i\) from the mean.
This is the intercept-only model. No leverage because we are omitting \(x\). The values of \(x\) have no influence on the slope that is 0 by definition.
When including the predictor \(x\) each observation \(x_i\) try to pull toward themselves the regression line. Some observations have a greater leverage and these observations are far from the mean of \(x\).
dat <- data.frame(
x, y, h = hatvalues(fit1)
)
ggplot(dat, aes(x = x, y = y)) +
geom_point(size = 3, aes(color = h)) +
geom_vline(xintercept = mean(x), lty = "dotted") +
geom_smooth(method = "lm", se = FALSE) +
labs(
color = latex2exp::TeX("$h_{ii}$")
)What is the point about residuals? If observations with a large \(h_{ii}\) pull the regression line toward themselves, this means the on average the residuals are lower for values with high leverage.
Same model as before but with more obervations to show the pattern:
If we divide the residuals using \(1 - h_{ii}\) (or better \(\sqrt{1 - h_{ii}}\)) we can stabilize this pattern and make residuals comparable taking into account the leverage.
Dividing each residual by each own leverage will make the residual variance homogeneous across all residuals, regardless the actual leverage.
In R there is the hatvalues() function to extract the diagonal of the \(\mathbf{H}\) matrix and rstandard() to compute the standardized residuals.
This is a simulated example where the true model contains a systematic component with \(\beta_0 + \beta_1 x_1 + \beta_2 x^2_1\) and we fit the model with and without the systematic component:
Firstly you can plot residuals against fitted values. Better using standardized Pearson, Deviance or Quantile residuals. car for example uses the raw Pearson residuals as default.
car::residualPlots(fit_x)#> Test stat Pr(>|Test stat|)
#> x 9.3063 < 2.2e-16 ***
#> Tukey test 9.3063 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Then we can plot each predictor against residuals. A pattern here could be caused by a misspesification of the predictor (e.g., linear instead of quadratic).
par(mfrow = c(1, 2))
scatter.smooth(x,
residuals(fit_x),
lpars = list(col = "red", lwd = 3),
xlab = "x",
ylab = "Residuals",
main = "y ~ x")
scatter.smooth(x,
residuals(fit_x2),
lpars = list(col = "red", lwd = 3),
xlab = "x",
ylab = "Residuals",
main = latex2exp::TeX("$y \\sim x + x^2$", bold = TRUE))Or a complete check with performance::check_model():
performance::check_model(fit_x)performance::check_model(fit_x2)As a general principle, after using the appropriate residuals (i.e., correcting for the mean-variance relationship and maybe standardizing) fitted againsts residuals and predictors against residuals should apper as a flat relationship with approximately constant variance.
We introduced the concept of leverage in the previous slides as an obervation with a extreme \(x_i\) value.
An outlier is an observation with a large residual thus the extremeness is about \(y_i\) (or better the distance between \(y_i\) and and corresponding fitted value \(\mu_i\))
Methods for assessing outliers have in common the following principle:
How the model change in terms of goodness-of-fit when a certain observation is removed/included?
Outliers are observation that are inconsistent with the fitted model producing large residuas.
An influential observation is an outlier with high leverage. A value with high leverage is a value that could, in principle, pull the regression line.
In addition, a value that is also inconsistent with \(y\) have also larger power of pulling the regression line.
When we remove an influential observation from the model, we can have a visible impact on the model fit.
Let’s visually illustrate the difference between the three concepts. Again let’s simulate a simple linear regression without any outlier or influential point:
Let’s add a point with very high leverage but still consistent with the model. Remember that a point with high leverage is only extreme on \(x\). Despite the high leverage the impact on the model fit, slope and so on is very limited:
leverage <- data.frame(
y = NA,
x = with(dat, mean(x) + 5 * sd(x)), # 5 standard deviation away
problem = "leverage"
)
leverage$y <- 0.3 * 0.1 * leverage$x + rnorm(1) # simulate from the same model
dat <- rbind(dat, leverage)
with(dat, plot(x, y, pch = 19, cex = 1.5, main = "without high leverage (red), full dataset (blue)"), col = ifelse(dat$problem == "none", 1, 2))
with(dat, abline(lm(y ~ x), col = "dodgerblue", lwd = 2))
with(dat[dat$problem == "none", ], abline(lm(y ~ x), col = "red", lwd = 2))Now we simulate an observation with a very large outlier but with small leverage. Still the impact on the fitted regression line is minimal.
out <- data.frame(
y = with(dat, mean(y) + sd(y) * 4),
x = mean(dat[dat$problem == "none", "x"]),
problem = "outlier"
)
dat <- rbind(dat, out)
dat_out <- dat[dat$problem != "leverage", ]
with(dat_out, plot(x, y, pch = 19, cex = 1.5, main = "without high outlier (red), full dataset (blue)"))
points(dat_out$x[dat_out$problem == "outlier"], dat_out$y[dat_out$problem == "outlier"], col = "red", pch = 19, cex = 1.8)
with(dat_out, abline(lm(y ~ x), col = "dodgerblue", lwd = 2))
with(dat_out[dat_out$problem == "none", ], abline(lm(y ~ x), col = "red", lwd = 2))This is the deadly combination. We can combine the two previous simulations to show the actual impact. Clearly this is an extreme example:
influential <- leverage
influential$y <- max(dat$y) + sd(dat$y) * 4 # extreme on y and x
influential$problem <- "influential"
dat <- rbind(dat, influential)
dat_inf <- dat[dat$problem %in% c("none", "influential"), ]
with(dat_inf, plot(x, y, pch = 19, cex = 1.5, col = ifelse(problem == "none", 1, 2), main = "without influential (red), full dataset (blue)"))
with(dat_inf, abline(lm(y ~ x), col = "dodgerblue", lwd = 2))
with(dat_inf[dat_inf$problem == "none", ], abline(lm(y ~ x), col = "red", lwd = 2))teddy dataset#>
#> Call:
#> glm(formula = Depression_pp01 ~ Parental_stress, family = binomial(link = "logit"),
#> data = teddy)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.323906 0.690689 -6.260 3.84e-10 ***
#> Parental_stress 0.036015 0.009838 3.661 0.000251 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 284.13 on 378 degrees of freedom
#> Residual deviance: 271.23 on 377 degrees of freedom
#> AIC: 275.23
#>
#> Number of Fisher Scoring iterations: 5
teddy dataset, \(R^2\)
performance::r2_mcfadden(fit_teddy)#> # R2 for Generalized Linear Regression
#> R2: 0.045
#> adj. R2: 0.038
performance::r2_tjur(fit_teddy)#> Tjur's R2
#> 0.03925653
teddy dataset, Tjur’s \(R^2\)
We can do the same manually to check the result:
teddy dataset, residualsResiduals for binomial models in the binary form are really bad:
car::residualPlots(fit_teddy)#> Test stat Pr(>|Test stat|)
#> Parental_stress 0.8433 0.3584
teddy dataset, quantile residualsqr <- statmod::qresiduals(fit_teddy)
qqnorm(qr)
abline(0,1)teddy dataset, quantile residualsplot(DHARMa::simulateResiduals(fit_teddy))teddy dataset, binned residualsGelman and colleagues Gelman & Hill (2006) proposed a type of residuals called binned residuals to solve the problem of the previous plot for Binomial GLMs:
teddy dataset, binned residualsWe can use the arm::binnedplot() function to automatically create and plot the binned residuals:
plot(performance::binned_residuals(fit_teddy))plot(cooks.distance(fit_teddy))car::dfbetaPlots(fit_teddy) # no intercept