Generalized Linear Models

Introduction

Filippo Gambarota

University of Padova

2023

Last Update: 2023-11-29

Beyond the Gaussian distribution

Quick recap about Gaussian distribution

  • The Gaussian distribution is part of the Exponential family
  • It is defined with mean (\(\mu\)) and the standard deviation (\(\sigma\))
  • It is symmetric (mean, mode and median are the same)
  • The support is \([- \infty, + \infty]\)

The Probability Density Function (PDF) is:

\[ f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2} \]

Quick recap about Gaussian distribution

Quick recap about Gaussian distribution

But…

In Psychology, variables do not always satisfy the properties of the Gaussian distribution. For example:

  • Reaction times
  • Percentages or proportions (e.g., task accuracy)
  • Counts
  • Likert scales

Reaction times

Non-negative and probably skewed data:

Binary outcomes

A series of binary (i.e., bernoulli) trials with two possible outcomes:

Counts

Number of symptoms for a group of patients during the last month:

Should we use a linear model for these variables?

Should we use a linear model for these variables?

Example: probability of passing the exam as a function of hours of study:

id studyh passed
1 27 0
2 28 1
3 76 1
4 2 0
... ... ...
47 68 1
48 69 1
49 91 1
50 10 0
dat |> 
  summarise(n = n(),
            npass = sum(passed),
            nfail = n - npass,
            ppass = npass / n)
#>    n npass nfail ppass
#> 1 50    21    29  0.42

Should we use a linear model for these variables?

Let’s plot the data:

Should we use a linear model for these variables?

Let’s fit a linear model passing ~ study_hours using lm:

Do you see something strange?

Should we use a linear model for these variables?

A little spoiler, the relationship should be probably like this:

Should we use a linear model for these variables?

Another example, the number of solved exercises in a semester as a function of the number of attended lectures (\(N = 100\)):

id nattended nsolved
1 11 2
2 5 1
3 45 8
4 5 0
... ... ...
97 62 20
98 56 14
99 43 11
100 11 2

Should we use a linear model for these variables?

Should we use a linear model for these variables?

Should we use a linear model for these variables?

Again, fitting the linear model seems partially appropriate but there are some problems:

Should we use a linear model for these variables?

Again, fitting the linear model seems partially appropriate but there are some problems:

Should we use a linear model for these variables?

Also the residuals are quite problematic:

Should we use a linear model for these variables?

Another little spoiler, the model should consider both the support of the y variable and the non-linear pattern. Probably something like this:

So what?

Both linear models somehow capture the expected relationship but there are serious fitting problems:

  • impossible predictions
  • poor fitting for non-linear patterns
  • linear regression assumptions not respected

As a general rule in life statistics:

All models are wrong, some are useful

We need a new class of models…

  • Taking into account the specific features of our response variable
  • Working on a linear scale when fitting the model and for inference
  • We need a model that is closer to the true data generation process

Generalized Linear Models

Main references

For a detailed introduction about GLMs

  • Chapters: 1 (intro), 4 (GLM fitting), 5 (GLM for binary data)

Main references

For a basic and well written introduction about GLM, especially the Binomial GLM

  • Chapters: 3 (intro GLMs), 4-5 (Binomial Logistic Regression)

Main references

Great resource for interpreting Binomial GLM parameters:

  • Chapters: 13-14 (Binomial Logistic GLM), 15 (Poisson and others GLMs)

Main references

Detailed GLMs book. Very useful especially for the diagnostic part:

  • Chapters: 8 (intro), 9 (Binomial GLM), 10 (Poisson GLM and overdispersion)

Main references

The holy book :)

  • Chapters: 14 and 15

Main references

Another good reference…

  • Chapters: 8

General idea

  • using distributions beyond the Gaussian
  • modeling non linear functions on the response scale
  • taking into account mean-variance relationships

Recipe for a GLM

  • Random Component
  • Systematic Component
  • Link Function

Random Component

The random component of a GLM identify the response variable \(y\) coming from a certain probability distribution.

Systematic Component

The systematic component or linear predictor (\(\eta\)) of a GLM is \(\beta_0 + \beta_1x_{1i} + ... + \beta_px_{pi}\).

\[\begin{align*} \eta = \beta_0 + \beta_1x_1 + ... + \beta_px_p \end{align*}\]

This part is invariant to the type of model and is the combination of explanatory variables to predict the expected value \(\mu\) (i.e. the mean) of the distribution.

Relevant distributions

Bernoulli distribution

A single Bernoulli trial is defined as:

\[ f(x, p) = p^x (1 - p)^{1 - x} \]

Where \(p\) is the probability of success and \(k\) the two possible results \(0\) and \(1\). The mean is \(p\) and the variance is \(p(1 - p)\)

Binomial distribution

The probability of having \(k\) success (e.g., 0, 1, 2, etc.), out of \(n\) trials with a probability of success \(p\) (and failing \(q = 1 - p\)) is:

\[ f(n, k, p)= \binom{n}{k} p^k(1 - p)^{n - k} \]

The \(np\) is the mean of the binomial distribution is \(np\) is the variance \(npq = np(1-p)\). The binomial distribution is just the repetition of \(n\) independent Bernoulli trials.

Bernoulli and Binomial

A classical example for a Bernoulli trial is the coin flip. In R:

n <- 1
p <- 0.7
rbinom(1, n, p) # a single bernoulli trial
#> [1] 0
n <- 10
rbinom(10, 1, p) # n bernoulli trials
#>  [1] 1 1 1 1 0 0 1 1 0 0
rbinom(1, n, p) # binomial version
#> [1] 6

Binomial

Code
n <- 30
p <- 0.7
dat <- data.frame(k = 0:n)
dat$y <- dbinom(dat$k, n, p)

dat |> 
    ggplot(aes(x = k, y = y)) +
    geom_point() +
    geom_segment(aes(x = k, xend = k, y = 0, yend = y)) +
    mytheme() +
    ylab("dbinom(x, n, p)") +
    ggtitle(latex2exp::TeX("$n = 30$, $p = 0.7$"))

Binomial in R

# generate k (success)
n <- 50 # number of trials
p <- 0.7 # probability of success
rbinom(1, n, p)
#> [1] 38

# let's do several experiments (e.g., participants)
rbinom(10, n, p)
#>  [1] 40 37 32 27 32 39 36 41 31 33

# calculate the probability density given k successes
n <- 50
k <- 25
p <- 0.5
dbinom(k, n, p)
#> [1] 0.1122752

# calculate the probability of doing 0, 1, 2, up to k successes
n <- 50
k <- 25
p <- 0.5
pbinom(k, n, p)
#> [1] 0.5561376

Binomial GLM

The Bernoulli distributions is used as random component when we have a binary dependent variable or the number of successes over the total number of trials:

  • Accuracy on a cognitive task
  • Patients recovered or not after a treatment
  • People passing or not the exam

The Bernoulli or the Binomial distributions can be used as random component when we have a binary dependent variable or the number of successes over the total number of trials.

When fitting a GLM with the binomial distribution we are including linear predictors on the expected value \(\mu\) i.e. the probability of success.

Binomial GLM

Most of the GLM models deal with a mean-variance relationship:

Code
p <- seq(0, 1, 0.01)
par(mfrow = c(1,2))
curve(plogis(x), 
      -4, 4, 
      ylab = "Probability", 
      xlab = "x", 
      lwd = 2,
      main = "Logistic Distribution")
plot(p, 
     p * (1 - p), 
     xlim = c(0, 1),
     xlab = latex("\\mu = p"), 
     ylab = latex("V = p(1 - p)"),
     type = "l",
     lwd = 2,
     main = "Mean-Variance Relationship")

Poisson distribution

The number of events \(k\) during a fixed time interval (e.g., number of new users on a website in 1 week) is:

\[\begin{align*} f(k,\lambda) = Pr(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \end{align*}\]

Where \(k\) is the number of occurrences (\(k = 0, 1, 2, ...\)), \(e\) is Euler’s number (\(e = 2.71828...\)) and \(!\) is the factorial function. The mean and the variance of the Poisson distribution is \(\lambda\).

Poisson distribution

As \(\lambda\) increases, the distribution is well approximated by a Gaussian distribution, but the Poisson is discrete.

Poisson distribution in R

n <- 1000
x <- rpois(n, lambda = 30)

mean(x)
#> [1] 30.372
var(x)
#> [1] 31.43305

We can also use all the other functions such as the dpois(), qpois() and ppois()

Poisson distribution in R

The mean-variance relationship can be easily seen with a continuous predictor:

Code
x <- rnorm(1000)
y <- rpois(1000, exp(log(10) + log(1.8)*x))

plot(x, y, pch = 19)

Gamma distribution

The Gamma distribution has several :parametrizations. One of the most common is the shape-scale parametrization:

\[ f(x;k,\theta )={\frac {x^{k-1}e^{-x/\theta }}{\theta ^{k}\Gamma (k)}} \] Where \(\theta\) is the scale parameter and \(k\) is the shape parameter.

Gamma distribution

Code
ggamma(mean = c(10, 20, 30), sd = c(10, 10, 10), show = "ss")

Gamma \(\mu\) and \(\sigma^2\)

The mean and variance are defined as:

  • \(\mu = k \theta\) and \(\sigma^2 = k \theta^2\) with the shape-scale parametrization
  • \(\mu = \frac{\alpha}{\beta}\) and \(\frac{\alpha}{\beta^2}\) with the shape-rate parametrization

Another important quantity is the coefficient of variation defined as \(\frac{\sigma}{\mu}\) or \(\frac{1}{\sqrt{k}}\) (or \(\frac{1}{\sqrt{\alpha}}\)).

Gamma distribution

Again, we can see the mean-variance relationship:

Code
ggamma(shape = c(5, 5), scale = c(10, 20), show = "ss")

Gamma parametrization

To convert between different parametrizations, you can use the gamma_params() function:

gamma_params <- function(shape = NULL, scale = 1/rate, rate = 1,
                         mean = NULL, sd = NULL,
                         eqs = FALSE){
  if(eqs){
    cat(rep("=", 25), "\n")
    cat(eqs()$gamma, "\n")
    cat(rep("=", 25), "\n")
  }else{
      if(is.null(shape)){
      var <- sd^2
      shape <- mean^2 / var
      scale <- mean / shape
      rate <- 1/scale
    } else if(is.null(mean) & is.null(sd)){
      if(is.null(rate)){
        scale <- 1/rate
      } else{
        rate <- 1/scale
      }
      mean <- shape * scale
      var <- shape * scale^2
      sd <- sqrt(var)
    }else{
      stop("when shape and scale are provided, mean and sd need to be NULL (and viceversa)")
    }
    out <- list(shape = shape, scale = scale, rate = rate, mean = mean, var = var, sd = sd)
    # coefficient of variation
    out$cv <- 1/sqrt(shape)
    return(out)
  }
}

Gamma parametrization

gm <- gamma_params(mean = 30, sd = 10)
unlist(gm)
#>       shape       scale        rate        mean         var          sd 
#>   9.0000000   3.3333333   0.3000000  30.0000000 100.0000000  10.0000000 
#>          cv 
#>   0.3333333
y <- rgamma(1000, shape = gm$shape, scale = gm$scale)

Gamma parametrization

GLM in R

Is not always easy to work with link functions. In R we can use the distribution(link = ) function to have several useful information. For example:

fam <- binomial(link = "logit")
fam$linkfun() # link function, from probability to eta
fam$linkinv() # inverse of the link function, from eta to probability

We are going to see the specific arguments, but this tricks works for any family and links, even if you do not remember the specific function or formula.

References

Agresti, A. (2015). Foundations of linear and generalized linear models. John Wiley & Sons. https://play.google.com/store/books/details?id=jlIqBgAAQBAJ
Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons. https://play.google.com/store/books/details?id=pHZyDwAAQBAJ
Dunn, P. K., & Smyth, G. K. (2018). Generalized linear models with examples in r. Springer. https://play.google.com/store/books/details?id=tBh5DwAAQBAJ
Faraway, J. J. (2016). Extending the linear model with r: Generalized linear, mixed effects and nonparametric regression models, second edition. Chapman; Hall/CRC. https://doi.org/10.1201/9781315382722
Fox, J. (2015). Applied regression analysis and generalized linear models. SAGE Publications. https://play.google.com/store/books/details?id=BlsKogEACAAJ
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879