Introduction to Generalized Linear Models

Psychological Sciences

Filippo Gambarota

University of Padova

Last modified: 24-01-2025

Beyond the Gaussian distribution

Quick recap about Gaussian distribution

The Gaussian distribution is part of the Exponential family
It is defined with mean (\(\mu\)) and the standard deviation (\(\sigma\))
It is symmetric (mean, mode and median are the same)
The support is \([- \infty, + \infty]\)

The Probability Density Function (PDF) is:

\[ f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2} \]

Quick recap about Gaussian distribution

But…

In Psychology, variables do not always satisfy the properties of the Gaussian distribution. For example:

Reaction times
Percentages or proportions (e.g., task accuracy)
Counts
Likert scales
…

Reaction times

Non-negative and probably skewed data:

Binary outcomes

A series of binary (i.e., bernoulli) trials with two possible outcomes:

Counts

Number of symptoms for a group of patients during the last month:

Should we use a linear model for these variables?

Example: probability of passing the exam as a function of hours of study:

id	studyh	passed
1	93	1
2	66	0
3	57	1
4	23	0
…	…	…
47	84	0
48	61	0
49	23	0
50	58	1

dat |> 
  summarise(n = n(),
            npass = sum(passed),
            nfail = n - npass,
            ppass = npass / n)

#> # A tibble: 1 × 4
#>       n npass nfail ppass
#>   <int> <int> <int> <dbl>
#> 1    50    18    32  0.36

Should we use a linear model for these variables?

Let’s plot the data:

Should we use a linear model for these variables?

Let’s fit a linear model passing ~ study_hours using lm:

Do you see something strange?

Should we use a linear model for these variables?

A little spoiler, the relationship should be probably like this:

Should we use a linear model for these variables?

Another example, the number of solved exercises in a semester as a function of the number of attended lectures (\(N = 100\)):

id	nattended	nsolved
1	18	3
2	20	1
3	35	5
4	8	0
…	…	…
97	45	4
98	28	4
99	52	12
100	54	13

Should we use a linear model for these variables?

Again, fitting the linear model seems partially appropriate but there are some problems:

Should we use a linear model for these variables?

Again, fitting the linear model seems partially appropriate but there are some problems:

Should we use a linear model for these variables?

Also the residuals are quite problematic:

Should we use a linear model for these variables?

Another little spoiler, the model should consider both the support of the y variable and the non-linear pattern. Probably something like this:

So what?

Both linear models somehow capture the expected relationship but there are serious fitting problems:

impossible predictions
poor fitting for non-linear patterns
linear regression assumptions not respected

As a general rule in ~~life~~ statistics:

All models are wrong, some are useful

We need a new class of models…

Taking into account the specific features of our response variable
Working on a linear scale when fitting the model and for inference
We need a model that is closer to the true data generation process

Generalized Linear Models

Main references

For a detailed introduction about GLMs

Chapters: 1 (intro), 4 (GLM fitting), 5 (GLM for binary data)

Main references

For a basic and well written introduction about GLM, especially the Binomial GLM

Chapters: 3 (intro GLMs), 4-5 (Binomial Logistic Regression)

Main references

Great resource for interpreting Binomial GLM parameters:

Chapters: 13-14 (Binomial Logistic GLM), 15 (Poisson and others GLMs)

Main references

Detailed GLMs book. Very useful especially for the diagnostic part:

Chapters: 8 (intro), 9 (Binomial GLM), 10 (Poisson GLM and overdispersion)

Main references

The holy book :)

Chapters: 14 and 15

Main references

Another good reference…

Chapters: 8

General idea

using distributions beyond the Gaussian
modeling non linear functions on the response scale
taking into account mean-variance relationships

Recipe for a GLM

Random Component
Systematic Component
Link Function

Random Component

The random component of a GLM identify the response variable \(y\) coming from a certain probability distribution.

Systematic Component

The systematic component or linear predictor (\(\eta\)) of a GLM is:

\[ \eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} \]

This part is invariant to the type of model and is the combination of explanatory variables to predict the expected value \(\mu\) (i.e. the mean) of the distribution.

Link Function

The link function \(g(\mu)\) is an invertible function that connects the mean \(\mu\) of the random component with the linear combination of predictors.

\(g(\mu) = \beta_0 + \beta_1x_{1i} + ... + \beta_px_{pi}\). The inverse of the link function \(g^{-1}\) map the linear predictor (\(\eta\)) into the original scale.

\[ g(\mu_i) = \eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} \]

\[ \mu_i = g(\eta_i)^{-1} = \eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} \]

Thus, the relationship between \(\mu\) and \(\eta\) is linear only when the link function is applied i.e. \(g(\mu) = \eta\).

Link function

The simplest link function is the identity link where \(g(\mu) = \mu\) and correspond to the standard linear model. In fact, the linear regression is just a GLM with a Gaussian random component and the identity link function.

Main distributions and link functions
Family	Link	Range
`gaussian`	identity	\[(-\infty,+\infty)\]
`gamma`	log	\[(0,+\infty)\]
`binomial`	logit	\[\frac{0, 1, ..., n_{i}}{n_{i}}\]
`binomial`	probit	\[\frac{0, 1, ..., n_{i}}{n_{i}}\]
`poisson`	log	\[0, 1, 2, ...\]

Relevant distributions

Bernoulli distribution

A single Bernoulli trial is defined as:

\[ f(x, p) = p^x (1 - p)^{1 - x} \]

Where \(p\) is the probability of success and \(k\) the two possible results \(0\) and \(1\). The mean is \(p\) and the variance is \(p(1 - p)\)

Binomial distribution

The probability of having \(k\) success (e.g., 0, 1, 2, etc.), out of \(n\) trials with a probability of success \(p\) (and failing \(q = 1 - p\)) is:

\[ f(n, k, p)= \binom{n}{k} p^k(1 - p)^{n - k} \]

The \(np\) is the mean of the binomial distribution and \(np(1 - p)\) is the variance. The binomial distribution is just the repetition of \(n\) independent Bernoulli trials.

Bernoulli and Binomial

A classical example for a Bernoulli trial is the coin flip. In R:

n <- 1
p <- 0.7
rbinom(1, n, p) # a single bernoulli trial

#> [1] 1

n <- 10
rbinom(10, 1, p) # n bernoulli trials

#>  [1] 0 0 1 1 1 0 1 0 1 0

rbinom(1, n, p) # binomial version

#> [1] 8

Binomial

Binomial in R

# generate k (success)
n <- 50 # number of trials
p <- 0.7 # probability of success
rbinom(1, n, p)
#> [1] 31

# let's do several experiments (e.g., participants)
rbinom(10, n, p)
#>  [1] 32 28 30 38 29 36 31 36 31 39

# calculate the probability density given k successes
n <- 50
k <- 25
p <- 0.5
dbinom(k, n, p)
#> [1] 0.1122752

# calculate the probability of doing 0, 1, 2, up to k successes
n <- 50
k <- 25
p <- 0.5
pbinom(k, n, p)
#> [1] 0.5561376

Binomial GLM

The Bernoulli distributions is used as random component when we have a binary dependent variable or the number of successes over the total number of trials:

Accuracy on a cognitive task
Patients recovered or not after a treatment
People passing or not the exam

The Bernoulli or the Binomial distributions can be used as random component when we have a binary dependent variable or the number of successes over the total number of trials.

When fitting a GLM with the binomial distribution we are including linear predictors on the expected value \(\mu\) i.e. the probability of success.

Binomial GLM

Most of the GLM models deal with a mean-variance relationship:

Poisson distribution

The number of events \(k\) during a fixed time interval (e.g., number of new users on a website in 1 week) is:

\[\begin{align*} f(k,\lambda) = Pr(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \end{align*}\]

Where \(k\) is the number of occurrences (\(k = 0, 1, 2, ...\)), \(e\) is Euler’s number (\(e = 2.71828...\)) and \(!\) is the factorial function. The mean and the variance of the Poisson distribution is \(\lambda\).

Poisson distribution

As \(\lambda\) increases, the distribution is well approximated by a Gaussian distribution, but the Poisson is discrete.

Poisson distribution in R

n <- 1000
x <- rpois(n, lambda = 30)

mean(x)

#> [1] 29.79

var(x)

#> [1] 29.25516

We can also use all the other functions such as the dpois(), qpois() and ppois()

Poisson distribution in R

The mean-variance relationship can be easily seen with a continuous predictor:

Gamma distribution

The Gamma distribution has several :parametrizations. One of the most common is the shape-scale parametrization:

\[ f(x;k,\theta )={\frac {x^{k-1}e^{-x/\theta }}{\theta ^{k}\Gamma (k)}} \] Where \(\theta\) is the scale parameter and \(k\) is the shape parameter.

Gamma distribution

ggamma(mean = c(10, 20, 30), sd = c(10, 10, 10), show = "ss")

Gamma \(\mu\) and \(\sigma^2\)

The mean and variance are defined as:

\(\mu = k \theta\) and \(\sigma^2 = k \theta^2\) with the shape-scale parametrization
\(\mu = \frac{\alpha}{\beta}\) and \(\frac{\alpha}{\beta^2}\) with the shape-rate parametrization

Another important quantity is the coefficient of variation defined as \(\frac{\sigma}{\mu}\) or \(\frac{1}{\sqrt{k}}\) (or \(\frac{1}{\sqrt{\alpha}}\)).

Gamma distribution

Again, we can see the mean-variance relationship:

ggamma(shape = c(5, 5), scale = c(10, 20), show = "ss")

Gamma parametrization

To convert between different parametrizations, you can use the gamma_params() function:

gamma_params <- function(shape = NULL, scale = 1/rate, rate = 1,
                         mean = NULL, sd = NULL,
                         eqs = FALSE){
  if(eqs){
    cat(rep("=", 25), "\n")
    cat(eqs()$gamma, "\n")
    cat(rep("=", 25), "\n")
  }else{
      if(is.null(shape)){
      var <- sd^2
      shape <- mean^2 / var
      scale <- mean / shape
      rate <- 1/scale
    } else if(is.null(mean) & is.null(sd)){
      if(is.null(rate)){
        scale <- 1/rate
      } else{
        rate <- 1/scale
      }
      mean <- shape * scale
      var <- shape * scale^2
      sd <- sqrt(var)
    }else{
      stop("when shape and scale are provided, mean and sd need to be NULL (and viceversa)")
    }
    out <- list(shape = shape, scale = scale, rate = rate, mean = mean, var = var, sd = sd)
    # coefficient of variation
    out$cv <- 1/sqrt(shape)
    return(out)
  }
}

Gamma parametrization

gm <- gamma_params(mean = 30, sd = 10)
unlist(gm)

#>       shape       scale        rate        mean         var          sd 
#>   9.0000000   3.3333333   0.3000000  30.0000000 100.0000000  10.0000000 
#>          cv 
#>   0.3333333

y <- rgamma(1000, shape = gm$shape, scale = gm$scale)

Gamma parametrization

GLM in R

Is not always easy to work with link functions. In R we can use the distribution(link = ) function to have several useful information. For example:

fam <- binomial(link = "logit")
fam$linkfun() # link function, from probability to eta
fam$linkinv() # inverse of the link function, from eta to probability

We are going to see the specific arguments, but this tricks works for any family and links, even if you do not remember the specific function or formula.

References

Agresti, A. (2015). Foundations of linear and generalized linear models. John Wiley & Sons. https://play.google.com/store/books/details?id=jlIqBgAAQBAJ

Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons. https://play.google.com/store/books/details?id=pHZyDwAAQBAJ

Dunn, P. K., & Smyth, G. K. (2018). Generalized linear models with examples in r. Springer. https://play.google.com/store/books/details?id=tBh5DwAAQBAJ

Faraway, J. J. (2016). Extending the linear model with r: Generalized linear, mixed effects and nonparametric regression models, second edition. Chapman; Hall/CRC. https://doi.org/10.1201/9781315382722

Fox, J. (2015). Applied regression analysis and generalized linear models. SAGE Publications.

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879