id | studyh | solved |
---|---|---|
1 | 32 | 19 |
2 | 38 | 20 |
3 | 67 | 41 |
4 | 10 | 12 |
... | ... | ... |
97 | 81 | 57 |
98 | 87 | 66 |
99 | 25 | 8 |
100 | 22 | 16 |
Filippo Gambarota
University of Padova
2023
Last Update: 2023-11-29
Everything that we discussed for the binomial GLM is also relevant for the Poisson GLM. We are gonna focus on specificity of the Poisson model in particular:
The Poisson distribution is defined as:
\[ p(y; \lambda) = \frac{\lambda^y e^{-\lambda}}{y!} \] Where the mean is \(\lambda\) and the variance is \(\lambda\)
As the mean increases also the variance increase and the distributions is approximately normal:
The common (and default in R) link function (\(g(\lambda)\)) for the Poisson distribution is the log link function and the inverse of link function is the exponential.
\[\begin{align*} log(\lambda) = \beta_0 + \beta_{1}X_{1} + ...\beta_pX_p \\ \lambda = e^{\beta_0 + \beta_{1}X_{1} + ...\beta_pX_p} \end{align*}\]
Let’s start by a simple example trying to explain the number of errors of math exercises by students (N = 100) as a function of the number of hours of study.
id | studyh | solved |
---|---|---|
1 | 32 | 19 |
2 | 38 | 20 |
3 | 67 | 41 |
4 | 10 | 12 |
... | ... | ... |
97 | 81 | 57 |
98 | 87 | 66 |
99 | 25 | 8 |
100 | 22 | 16 |
We can fit the model using the glm
function in R setting the appropriate random component (family
) and the link function (link
):
#>
#> Call:
#> glm(formula = solved ~ studyh, family = poisson(link = "log"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.2596 -0.7741 -0.1367 0.7117 2.6079
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.2830811 0.0505563 45.16 <2e-16 ***
#> studyh 0.0200652 0.0007048 28.47 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 1011.993 on 99 degrees of freedom
#> Residual deviance: 94.958 on 98 degrees of freedom
#> AIC: 616.12
#>
#> Number of Fisher Scoring iterations: 4
(Intercept)
\(2.283\) is the log of the expected number of solved exercises for a student with 0 hours of studying. Taking the exponential we obtain the estimation on the response scale \(9.807\)
studyh
\(0.020\) is the increase in the expected increase of (log) solved exercises for a unit increase in hours of studying. Taking the exponential we obtain the ratio of increase of the number of solved exercises \(1.020\)
Again, as in the binomial model, the effects are linear on the log scale but non-linear on the response scale.
The non-linearity can be easily seen using the predict()
function:
linear <- predict(fit, newdata = data.frame(studyh = c(10, 11)))
diff(linear) # same as the beta0
#> 2
#> 0.02006518
#> 2
#> 0.2429288
# ratio of increase when using the response scale
nonlinear[2]/nonlinear[1]
#> 2
#> 1.020268
#> studyh
#> 1.020268
Let’s make a similar example with the number of solved exercises comparing students who attended online classes and students attending in person. The class_c
is the dummy version of class
(0 = online, 1 = inperson).
id | class | class_c | solved |
---|---|---|---|
1 | online | 0 | 8 |
2 | inperson | 1 | 18 |
3 | online | 0 | 12 |
4 | inperson | 1 | 12 |
... | ... | ... | ... |
97 | online | 0 | 10 |
98 | inperson | 1 | 21 |
99 | online | 0 | 8 |
100 | inperson | 1 | 17 |
R by default set the categorical variables using dummy-coding. In this case we set the reference category to online
.
#>
#> Call:
#> glm(formula = solved ~ class, family = poisson(link = "log"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.38818 -0.77486 -0.07274 0.74539 2.81362
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.22138 0.04657 47.695 <2e-16 ***
#> classinperson 0.48801 0.05917 8.248 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 173.39 on 99 degrees of freedom
#> Residual deviance: 103.33 on 98 degrees of freedom
#> AIC: 534.41
#>
#> Number of Fisher Scoring iterations: 4
class
is 0. Thus the expected number of solved exercises for online students.classinperson
is the difference in log solved exercises between online and in person classes. In the response scale is the expected increase in the ratio of solved exercises. People doing in person classes solve 162.91% of the exercises of people doing online classesOverdispersion concerns observing a greater variance compared to what would have been expected by the model.
The overdispersion \(\phi\) can be estimated using Pearson Residuals:
\[\begin{align*} \hat \phi = \frac{\sum_{i = 1}^n \frac{(y_i - \hat y_i)^2}{\hat y_i}}{n - p - 1} \end{align*}\]
Where the numerator is the sum of squared Pearson residuals, \(n\) is the number of observations and \(k\) the number of predictors. For standard Binomial and Poisson models \(\phi = 1\).
If the model is correctly specified for binomial and poisson models the ratio is equal to 1, of the ratio is \(> 1\) there is evidence for overdispersion. In practical terms, if the residual deviance is higher than the residuals degrees of freedom, there is evidence for overdispersion.
P <- sum(residuals(fit, type = "pearson")^2)
P / df.residual(fit) # nrow(fit$dat) - length(fit_p$coefficients)
#> [1] 1.036323
To formally test for overdispersion i.e. testing if the ratio is significantly different from 1 we can use the performance::check_overdispersion()
function.
performance::check_overdispersion(fit)
#> # Overdispersion test
#>
#> dispersion ratio = 1.036
#> Pearson's Chi-Squared = 101.560
#> p-value = 0.383
Pearson residuals are defined as:
\[\begin{align*} r_p = \frac{y_i - \hat y_i}{\sqrt{V(\hat y_i)}} \\ V(\hat y_i) = \sqrt{\hat y_i} \end{align*}\]
Remember that the mean and the variance are the same in Poisson models. If the model is correct, the standardized residuals should be normally distributed with mean 0 and variance 1.
The overdispersion can be expressed also in terms of variance-mean ratio. In fact, when the ratio is 1, there is no evidence of overdispersion.
There could be multiple causes for overdispersion:
This (simulated) dataset contains \(n = 30\) observations coming from a poisson model in the form \(y = 1 + 2x\) and \(n = 7\) observations coming from a model \(y = 1 + 10x\).
Clearly the sum of squared pearson residuals is inflated by these values producing more variance compared to what should be expected.
#> mean var
#> 2.756757 6.689189
performance::check_overdispersion(fit)
#> # Overdispersion test
#>
#> dispersion ratio = 1.515
#> Pearson's Chi-Squared = 53.019
#> p-value = 0.026
Let’s imagine to analyze again the dataset with the number of solved exercises. We have the effect of the studyh
variable. In addition we have the effect of the class
variable, without interaction.
id | class | studyh | class_c | lp | solved |
---|---|---|---|---|---|
1 | online | 82 | 0 | 22.613 | 24 |
2 | inperson | 65 | 1 | 47.734 | 50 |
3 | online | 12 | 0 | 11.268 | 15 |
4 | inperson | 54 | 1 | 42.785 | 50 |
... | ... | ... | ... | ... | ... |
37 | online | 85 | 0 | 23.298 | 35 |
38 | inperson | 55 | 1 | 43.213 | 43 |
39 | online | 80 | 0 | 22.167 | 26 |
40 | inperson | 77 | 1 | 53.788 | 54 |
We can also have a look at the data:
Now let’s fit the model considering only studyh
and ignoring the group:
#>
#> Call:
#> glm(formula = solved ~ studyh, family = poisson(link = "log"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -5.047 -2.247 -0.107 2.214 3.246
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.692791 0.068831 39.12 <2e-16 ***
#> studyh 0.013624 0.001063 12.82 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 417.99 on 39 degrees of freedom
#> Residual deviance: 243.99 on 38 degrees of freedom
#> AIC: 451.39
#>
#> Number of Fisher Scoring iterations: 4
Essentially, we are fitting an average relationship across groups but the model ignore that the two groups differs, thus the observed variance is definitely higher because we need two separate means to explain the class
effect.
By fitting the appropriate model, the overdispersion is no longer a problem:
#>
#> Call:
#> glm(formula = solved ~ studyh + class, family = poisson(link = "log"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.93006 -0.48422 0.00417 0.43834 1.97814
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.275159 0.078239 29.08 <2e-16 ***
#> studyh 0.010895 0.001068 10.21 <2e-16 ***
#> classinperson 0.907365 0.065373 13.88 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 417.993 on 39 degrees of freedom
#> Residual deviance: 29.022 on 37 degrees of freedom
#> AIC: 238.41
#>
#> Number of Fisher Scoring iterations: 4
performance::check_overdispersion(fit)
#> # Overdispersion test
#>
#> dispersion ratio = 6.115
#> Pearson's Chi-Squared = 232.369
#> p-value = < 0.001
performance::check_overdispersion(fit2)
#> # Overdispersion test
#>
#> dispersion ratio = 0.777
#> Pearson's Chi-Squared = 28.738
#> p-value = 0.832
Also the residuals plots clearly improved after including all relevant predictors:
Before analyzing the two main strategies to deal with overdispersion, it is important to understand why it is very problematic.
Despite the estimated parameters (\(\beta\)) are not affected by overdispersion, standard errors are underestimated. The underestimation increase as the severity of the overdispersion increase.
Underestimated standard errors produce (biased) higher \(z\) scores inflating the type-1 error (i.e., lower p values)
The estimated overdispersion of fit
is ~ 6.4208936. The summary()
function in R has a dispersion
argument to check how model parameters are affected.
phi <- fit$deviance / fit$df.residual
summary(fit, dispersion = 1)$coefficients # the default
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.69279058 0.068830558 39.12202 0.000000e+00
#> studyh 0.01362422 0.001062724 12.82009 1.265575e-37
summary(fit, dispersion = 2)$coefficients # assuming phi = 2
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.69279058 0.097341109 27.663447 1.923115e-168
#> studyh 0.01362422 0.001502919 9.065171 1.244091e-19
summary(fit, dispersion = phi)$coefficients # the appropriate
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.69279058 0.174413070 15.439156 8.926080e-54
#> studyh 0.01362422 0.002692888 5.059333 4.207258e-07
By using multiple values for \(\phi\) we can see the impact on the standard error. \(\phi = 1\) is what is assumed by the Poisson model.
If all the variables are included and there are no outliers, the phenomenon itself contains more variability compared to what predicted by the Poisson. There are two main approaches to deal with the situation:
The quasi-poisson model is essentially a poisson model that estimate the \(\phi\) parameter and adjust the standard errors accordingly. Again, assuming to fit the studyh
only model (with overdispersion):
fit <- glm(solved ~ studyh, data = dat, family = quasipoisson(link = "log"))
summary(fit)
#>
#> Call:
#> glm(formula = solved ~ studyh, family = quasipoisson(link = "log"),
#> data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -5.047 -2.247 -0.107 2.214 3.246
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.692791 0.170209 15.821 < 2e-16 ***
#> studyh 0.013624 0.002628 5.184 7.46e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for quasipoisson family taken to be 6.115055)
#>
#> Null deviance: 417.99 on 39 degrees of freedom
#> Residual deviance: 243.99 on 38 degrees of freedom
#> AIC: NA
#>
#> Number of Fisher Scoring iterations: 4
The quasi-poisson model estimates the same parameters (\(\beta\)) and adjust standard errors. Notice that using summary(dispersion = )
is the same as fitting a quasi-Poisson model and using the estimated \(\phi\).
The quasi-poisson model is useful because it is a very simple way to deal with overdispersion.
The variance (\(V(\mu)\)) of the Poisson model is no longer \(\mu\) but \(V(\mu) = \mu\phi\). When \(\phi\) is close to 1, the quasi-Poisson model reduced to a standard Poisson model
Here we compare the expected variance (\(\pm 2\sqrt{V(y_i)}\))1 of the Quasi-Poisson and Poisson models:
The main problem of quasi-* models is that they are not a specific distribution family and there is not a likelihood function. For this reason, we cannot perform model comparison the standard AIC/BIC. See (Ver Hoef & Boveng, 2007) for an overview.
A negative binomial model is different probability distribution with two parameters: the mean as in standard poisson model and the dispersion parameter. Similarly to the quasi-poisson model it estimates a dispersion parameter.
Practically the negative-binomial model can be considered as a hierarchical model:
\[\begin{align*} y_i \sim Poisson(\lambda_i) \\ \lambda_i \sim Gamma(\mu, \frac{1}{\theta}) \end{align*}\]
In this way, the Gamma distribution regulate the \(\lambda\) variability that otherwise is fixed to a single value in the Poisson model.
The Poisson-Gamma mixture can be expressed as:
\[ p(y; \mu, \theta) = \frac{\Gamma(y + \theta)}{\Gamma(\theta)\Gamma(y + 1)}\left(\frac{\mu}{\mu + \theta}\right)^{y} \left(\frac{\theta}{\mu + \theta}\right)^\theta \]
Where \(\theta\) is the overdispersion parameter, \(\Gamma()\) is the gamma function. The mean is \(\mu\) and the variance is \(\mu + \frac{\mu^2}{\theta}\). The \(\theta\) is the inverse of the overdispersion in terms that as \(1/\theta\) increase the data are more overdispersed. As \(1/\theta\) approaches 0, the negative-binomial reduced to a Poisson model.
Compared to the Poisson model, the negative binomial allows for overdispersion estimating the parameter \(\theta\) and compared to the quasi-poisson model the variance is not a linear increase of the mean (\(V(\mu) = \theta\mu\)) but have a quadratic relationship \(V(\mu) = \mu + \mu^2/\theta\)
We can use the MASS
package to implement the negative binomial distribution using the MASS::rnegbin()
:
The \(\theta\) parameter is the estimated dispersion. To note, is not the same as \(\phi\) in the quasi-poisson model. As \(\theta\) increase, the overdispersion is reduced and the model is similar to a standard Poisson model.
For fitting a negative-binomial model we cannot use the glm
function but we need to use the MASS::glm.nb()
function. The syntax is almost the same but we do not need to specify the family because this function only fit negative-binomial models. Let’s simulate some data coming from a negative binomial distribution with \(\theta = 10\)
id | x | y |
---|---|---|
1 | 0.427 | 11 |
2 | 0.365 | 13 |
... | ... | ... |
99 | 0.966 | 41 |
100 | 0.154 | 12 |
Then we can fit the model:
#>
#> Call:
#> MASS::glm.nb(formula = y ~ x, data = dat, init.theta = 11.32327809,
#> link = log)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.7104 -0.7507 -0.1075 0.6324 2.4510
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.17693 0.08282 26.28 <2e-16 ***
#> x 1.82119 0.13811 13.19 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Negative Binomial(11.3233) family taken to be 1)
#>
#> Null deviance: 273.823 on 99 degrees of freedom
#> Residual deviance: 98.035 on 98 degrees of freedom
#> AIC: 696.14
#>
#> Number of Fisher Scoring iterations: 1
#>
#>
#> Theta: 11.32
#> Std. Err.: 2.38
#>
#> 2 x log-likelihood: -690.143
Here we compare the expected variance (\(\pm 2\sqrt{V(y_i)}\)) of the Negative binomial and Poisson models:
Here we compare the expected variance (\(\pm 2\sqrt{V(y_i)}\)) of the three models:
We can compare the coefficients fitting the Poisson, the Quasi-Poisson and the Negative-binomial model:
fit_p <- glm(y ~ x, data = dat, family = poisson(link = "log"))
fit_qp <- glm(y ~ x, data = dat, family = quasipoisson(link = "log"))
car::compareCoefs(fit_nb, fit_p, fit_qp)
#> Calls:
#> 1: MASS::glm.nb(formula = y ~ x, data = dat, init.theta = 11.32327809, link
#> = log)
#> 2: glm(formula = y ~ x, family = poisson(link = "log"), data = dat)
#> 3: glm(formula = y ~ x, family = quasipoisson(link = "log"), data = dat)
#>
#> Model 1 Model 2 Model 3
#> (Intercept) 2.1769 2.2043 2.2043
#> SE 0.0828 0.0534 0.0970
#>
#> x 1.8212 1.7723 1.7723
#> SE 0.1381 0.0802 0.1456
#>
#extra
1
If you want to try to simulate NB data you can use the MASS::rnegbin(n, mu, theta)
function or the rnb(n, mu, vmr)
custom function that could be more intuitive because requires \(\mu\) (i.e., the mean) and vmr
that is the desired variance-mean ratio. Using message = TRUE
it will tells the \(\theta\) value:
#extra
You can also use the theta
parameter within the rnb
function. As \(\theta\) increase the overdispersion is reduced. Thus you can think the inverse of \(1/\theta\) as an index of overdispersion.
y <- rnb(1e5, mu = 10, theta = 10, message = TRUE)
y <- rnb(1e5, mu = 10, theta = 5, message = TRUE)
y <- rnb(1e5, mu = 10, theta = 1, message = TRUE)
The Deviance based pseudo-\(R^2\) is computed from the ratio between the residual deviance and the null deviance1:
\[ R^2 = 1 - \frac{D_{current}}{D_{null}} \]
1 - fit_p$deviance/fit_p$null.deviance
#> [1] 0.6229535
The McFadden’s pseudo-\(R^2\) compute the ratio between the log-likelihood of the intercept-only (i.e., null) model and the current model (McFadden, 1987):
\[ R^2 = 1 - \frac{\log(\mathcal{L_{current}})}{\log(\mathcal{L_{null}})} \]
There is also the adjusted version that take into account the number of parameters of the model. In R can be computed manually or using the performance::r2_mcfadden()
:
performance::r2_mcfadden(fit_p)
#> # R2 for Generalized Linear Regression
#> R2: 0.396
#> adj. R2: 0.394
The Cox and Snell’s (Cox & Snell, 1989) \(R^2\) is defined as:
\[ R^2 = 1 - \left(\frac{\mathcal{L_{null}}}{\mathcal{L_{current}}}\right)^{\frac{2}{n}} \]
Where Nagelkerke (Nagelkerke, 1991) provide a correction to set the range of values between 0 and 1.
performance::r2_nagelkerke(fit_p)
#> Nagelkerke's R2
#> 0.9950421
Let’s simulate a GLM where there is a main effect but no interaction:
dat <- sim_design(200, nx = list(x = rnorm(200)), cx = list(g = c("a", "b")))
dat <- sim_data(dat, exp(log(10) + log(1.5) * x + log(2) * g_c + 0 * g_c * x), model = "poisson")
dat |>
ggplot(aes(x = x, y = y, color = g)) +
geom_point() +
stat_smooth(method = "glm",
method.args = list(family = poisson()))
Let’s fit a linear model and the Poisson GLM with the interaction term:
#>
#> Call:
#> lm(formula = y ~ x * g, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -10.9594 -2.7974 -0.3394 2.4466 14.8487
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 11.0610 0.2941 37.603 <2e-16 ***
#> x 3.9829 0.2979 13.371 <2e-16 ***
#> g2 9.9577 0.4163 23.921 <2e-16 ***
#> x:g2 4.0287 0.4347 9.269 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.116 on 396 degrees of freedom
#> Multiple R-squared: 0.7717, Adjusted R-squared: 0.77
#> F-statistic: 446.3 on 3 and 396 DF, p-value: < 2.2e-16
#>
#> Call:
#> glm(formula = y ~ x * g, family = poisson(link = "log"), data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.61194 -0.71125 -0.05532 0.55622 2.75160
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.33573 0.02238 104.350 <2e-16 ***
#> x 0.38286 0.02254 16.986 <2e-16 ***
#> g2 0.64219 0.02765 23.226 <2e-16 ***
#> x:g2 0.02549 0.02850 0.894 0.371
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 1816.48 on 399 degrees of freedom
#> Residual deviance: 385.92 on 396 degrees of freedom
#> AIC: 2157.8
#>
#> Number of Fisher Scoring iterations: 4
The linear model is estimating a completely misleading interaction due to the mean-variance relationship of the Poisson distribution.
lp_plot <- dat |>
select(id, g, x) |>
add_predict(fit_glm, se.fit = TRUE) |>
ggplot(aes(x = x,
y = fit,
fill = g,
ymin = fit - se.fit * 2,
ymax = fit + se.fit * 2)) +
geom_line() +
geom_ribbon(alpha = 0.5) +
ylab("y") +
ggtitle("Linear Predictor")
lp_plot | plot(ggeffect(fit_glm, terms = c("x", "g"))) | plot(ggeffect(fit_lm, terms = c("x", "g")))