Lab 5

Author
Affiliation

Filippo Gambarota

University of Padova

Published

12, 8, 2025

devtools::load_all() # if using the rproject dowloaded from the slides
# source("utils-glm.R") # if using a standard setup
library(here)
library(tidyr) # for data manipulation
library(dplyr) # for data manipulation
library(ggplot2) # plotting
library(car) # general utilities
library(effects) # for extracting and plotting effects 
library(emmeans) # for marginal means
data("nwords")
dat <- nwords

Overview

This dataset nwords represent a developmental psychology study investigating the factors that influence children language development. The dataset includes information about the number of words in a a language task, and some potential predictors: caregiver behavior, socioeconomic status, amount of time spent with the child during a week and the presence of a baby-sitter or not.

  • child: numeric variable representing the child ID number.
  • timebonding: numeric variable representing the average hours per week of child-parent activities
  • nwords: numeric variable representing the number of words produced by the child in a language task.
  • caregiving: numeric variable representing the caregiver’s behavior in a parent-child interaction task, measured on a scale from 0 to 10
  • ses: categorical variable representing the family socioeconomic status, with three levels: “Low”, “Middle”, and “High”
  1. Importing data and overall check
    • think about factors levels, scale of the numerical variables, etc.
  2. Exploratory data analysis of predictors and the relationships between predictors and the number of words
  3. Model fitting with glm() and poisson family starting from additive models and then adding the interactions.
  4. Model diagnostic of the chosen model
    • overdispersion
    • residuals
    • outliers and influential points
  5. Interpreting the effects and plotting
  6. Fit a quasi-poisson and negative binomial version of the chosen model, and check how parameters are affected

1. Importing data and overall check

load(here("data/nwords.rda"))
str(dat)
'data.frame':   150 obs. of  6 variables:
 $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ timebonding: num  19 16 32 15 33 10 27 29 26 14 ...
 $ caregiving : num  8 8 12 7 9 7 8 11 9 6 ...
 $ ses        : Factor w/ 3 levels "low","middle",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "contrasts")= num [1:3, 1:2] 0 1 0 0 0 1
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:3] "low" "middle" "high"
  .. .. ..$ : chr [1:2] "2" "3"
 $ babysitter : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "contrasts")= num [1:2, 1] 0 1
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:2] "no" "yes"
  .. .. ..$ : chr "2"
 $ nwords     : int  18 28 46 24 27 13 16 14 27 20 ...

Check for NA values:

sapply(dat, function(x) sum(is.na(x)))
         id timebonding  caregiving         ses  babysitter      nwords 
          0           0           0           0           0           0 

Everything seems good, we do not have NA values.

Let’s convert categorical variables into factor setting the appropriate order:

  • ses: low, middle, high
  • babysitter: no, yes
dat$ses <- factor(dat$ses, levels = c("low", "middle", "high"))
dat$babysitter <- factor(dat$babysitter, levels = c("no", "yes"))

levels(dat$ses)
[1] "low"    "middle" "high"  
levels(dat$babysitter)
[1] "no"  "yes"

timebonding and caregiving are the two numerical predictors. Given that we are going to fit main effects and interactions and we want to interpret the intercept and test the interaction on a meaningful point, we create two centered versions of these variables:

dat$timebonding0 <- dat$timebonding - mean(dat$timebonding)
dat$caregiving0 <- dat$caregiving - mean(dat$caregiving)

Then we can see what happen to the variables:

cols <- grepl("timebonding|caregiving", names(dat))
lapply(dat[, cols], function(x) round(c(mean = mean(x), sd = sd(x)), 3))
$timebonding
  mean     sd 
14.980  8.755 

$caregiving
 mean    sd 
5.587 2.560 

$timebonding0
 mean    sd 
0.000 8.755 

$caregiving0
mean   sd 
0.00 2.56 

2. Exploratory data analysis

summary(dat)
       id          timebonding      caregiving         ses     babysitter
 Min.   :  1.00   Min.   : 0.00   Min.   : 0.000   low   :45   no :103   
 1st Qu.: 38.25   1st Qu.: 8.00   1st Qu.: 4.000   middle:75   yes: 47   
 Median : 75.50   Median :15.50   Median : 5.000   high  :30             
 Mean   : 75.50   Mean   :14.98   Mean   : 5.587                         
 3rd Qu.:112.75   3rd Qu.:21.75   3rd Qu.: 7.000                         
 Max.   :150.00   Max.   :36.00   Max.   :13.000                         
     nwords       timebonding0     caregiving0     
 Min.   : 0.00   Min.   :-14.98   Min.   :-5.5867  
 1st Qu.: 9.00   1st Qu.: -6.98   1st Qu.:-1.5867  
 Median :16.00   Median :  0.52   Median :-0.5867  
 Mean   :18.75   Mean   :  0.00   Mean   : 0.0000  
 3rd Qu.:25.00   3rd Qu.:  6.77   3rd Qu.: 1.4133  
 Max.   :58.00   Max.   : 21.02   Max.   : 7.4133  

Let’s do some plotting of predictors:

par(mfrow = c(2,2))
hist(dat$timebonding)
hist(dat$caregiving)
barplot(table(dat$babysitter))
barplot(table(dat$ses))

Comments?

Also the response variable:

hist(dat$nwords, probability = TRUE)
lines(density(dat$nwords), col = "salmon", lwd = 2)

Let’s plot the theoretical distributions, Poisson and Gaussian:

m <- mean(dat$nwords)
s <- sd(dat$nwords)
xs <- seq(0, 50, 1)

hist(dat$nwords, probability = TRUE, xlim = c(-10, 50), ylim = c(0, 0.1))
curve(dnorm(x, m, s), add = TRUE, col = "green", lwd = 2)
lines(xs, dpois(xs, m), col = "red", lwd = 2)
lines(xs, dpois(xs, m), col = "red", lwd = 2)

Comments?

Let’s plot some bivariate distributions:

# caregiving ~ timebonding
r <- cor(dat$timebonding, dat$caregiving)
plot(dat$timebonding, dat$caregiving, pch = 19)
abline(lm(dat$caregiving ~ dat$timebonding), col = "red", lwd = 2)
text(30, 2, label = paste("r =", round(r, 2)))

par(mfrow = c(1,2))
boxplot(caregiving ~ ses, data = dat)
boxplot(timebonding ~ ses, data = dat)

mosaicplot(table(dat$ses, dat$babysitter))

# or
barplot(table(dat$babysitter, dat$ses), col = c("red", "green"), beside = TRUE)
legend(7,45, legend = c("no", "yes"), fill = c("red", "green"))

par(mfrow = c(1,2))
boxplot(caregiving ~ babysitter, data = dat)
boxplot(timebonding ~ babysitter, data = dat)

Then the relationships with the nwords variable:

par(mfrow = c(2,2))
plot(dat$timebonding, dat$nwords, pch = 19)
plot(dat$caregiving, dat$nwords, pch = 19)
boxplot(dat$nwords ~ dat$ses)
boxplot(dat$nwords ~ dat$babysitter)

Comments?

Finally some interactions plot (with lm):

par(mfrow = c(2,1))
colors <- c(low = "red", middle = "blue", high = "green")
plot(dat$timebonding, dat$nwords, col = colors[dat$ses], pch = 19)
lms <- lapply(split(dat, dat$ses), function(x) lm(nwords ~ timebonding, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL
plot(dat$caregiving, dat$nwords, col = colors[dat$ses], pch = 19)
lms <- lapply(split(dat, dat$ses), function(x) lm(nwords ~ caregiving, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))

[[1]]
NULL

[[2]]
NULL

[[3]]
NULL
par(mfrow = c(2,1))
colors <- c(no = "orange", yes = "purple")

plot(dat$timebonding, dat$nwords, col = colors[dat$babysitter], pch = 19)
lms <- lapply(split(dat, dat$babysitter), function(x) lm(nwords ~ timebonding, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))
[[1]]
NULL

[[2]]
NULL
plot(dat$caregiving, dat$nwords, col = colors[dat$babysitter], pch = 19)
lms <- lapply(split(dat, dat$babysitter), function(x) lm(nwords ~ caregiving, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))

[[1]]
NULL

[[2]]
NULL

3. Model fitting with glm() and poisson

Let’s start by using an additive model:

fit <- glm(nwords ~ timebonding + caregiving + babysitter + ses, family = poisson(link = "log"), data = dat)
summary(fit)

Call:
glm(formula = nwords ~ timebonding + caregiving + babysitter + 
    ses, family = poisson(link = "log"), data = dat)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.143168   0.086143  24.879  < 2e-16 ***
timebonding    0.033816   0.003239  10.439  < 2e-16 ***
caregiving     0.042120   0.012418   3.392 0.000694 ***
babysitteryes  0.022314   0.048813   0.457 0.647577    
sesmiddle     -0.007068   0.049779  -0.142 0.887094    
seshigh       -0.221124   0.079552  -2.780 0.005442 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1085.21  on 149  degrees of freedom
Residual deviance:  612.78  on 144  degrees of freedom
AIC: 1307.4

Number of Fisher Scoring iterations: 5

And always plotting before anything else:

plot(allEffects(fit))

Comments? How could you describe the results? Something different from the descriptive statistics?

car::residualPlots(fit)

            Test stat Pr(>|Test stat|)  
timebonding    5.5386          0.01860 *
caregiving     4.9330          0.02635 *
babysitter                              
ses                                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comments? Are we missing something?

dat |> 
    ggplot(aes(x = timebonding, y = nwords, color = ses)) +
    geom_point() +
    stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)

dat |> 
    ggplot(aes(x = caregiving, y = nwords, color = ses)) +
    geom_point() +
    stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)

dat |> 
    ggplot(aes(x = timebonding, y = nwords, color = babysitter)) +
    geom_point() +
    stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)

dat |> 
    ggplot(aes(x = caregiving, y = nwords, color = babysitter)) +
    geom_point() +
    stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)

Let’s add some interactions:

fit2 <- glm(nwords ~ timebonding*ses + caregiving*ses + timebonding*babysitter + caregiving*babysitter, family = poisson(link = "log"), data = dat)
summary(fit2)

Call:
glm(formula = nwords ~ timebonding * ses + caregiving * ses + 
    timebonding * babysitter + caregiving * babysitter, family = poisson(link = "log"), 
    data = dat)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                2.525633   0.133700  18.890  < 2e-16 ***
timebonding                0.019826   0.005177   3.830 0.000128 ***
sesmiddle                 -0.613019   0.159444  -3.845 0.000121 ***
seshigh                   -0.343371   0.211727  -1.622 0.104853    
caregiving                 0.032736   0.019744   1.658 0.097307 .  
babysitteryes             -0.056880   0.149660  -0.380 0.703898    
timebonding:sesmiddle      0.021714   0.007058   3.077 0.002093 ** 
timebonding:seshigh        0.020462   0.012628   1.620 0.105147    
sesmiddle:caregiving       0.025867   0.027244   0.949 0.342389    
seshigh:caregiving        -0.066959   0.055576  -1.205 0.228270    
timebonding:babysitteryes  0.003147   0.008862   0.355 0.722495    
caregiving:babysitteryes  -0.000731   0.038348  -0.019 0.984792    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1085.21  on 149  degrees of freedom
Residual deviance:  582.35  on 138  degrees of freedom
AIC: 1289

Number of Fisher Scoring iterations: 5
plot(allEffects(fit2))

4. Model fitting with MASS::glm.nb()

There is still evidence for overdispersion, even after including all predictors and a series of interactions. Let’s assume that this is our most complex model, we need to take into account the overdispersion:

performance::check_overdispersion(fit2)
# Overdispersion test

       dispersion ratio =   4.352
  Pearson's Chi-Squared = 600.567
                p-value = < 0.001
fit3 <- MASS::glm.nb(nwords ~ timebonding*ses + caregiving*ses + timebonding*babysitter + caregiving*babysitter, data = dat)

summary(fit3)

Call:
MASS::glm.nb(formula = nwords ~ timebonding * ses + caregiving * 
    ses + timebonding * babysitter + caregiving * babysitter, 
    data = dat, init.theta = 6.186322679, link = log)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                2.471985   0.292300   8.457   <2e-16 ***
timebonding                0.023927   0.011268   2.123   0.0337 *  
sesmiddle                 -0.659979   0.338304  -1.951   0.0511 .  
seshigh                   -0.432890   0.414592  -1.044   0.2964    
caregiving                 0.028743   0.045294   0.635   0.5257    
babysitteryes              0.073044   0.275110   0.266   0.7906    
timebonding:sesmiddle      0.020069   0.015103   1.329   0.1839    
timebonding:seshigh        0.023178   0.023580   0.983   0.3256    
sesmiddle:caregiving       0.040990   0.059430   0.690   0.4904    
seshigh:caregiving        -0.054526   0.104723  -0.521   0.6026    
timebonding:babysitteryes -0.007602   0.017341  -0.438   0.6611    
caregiving:babysitteryes   0.009204   0.075962   0.121   0.9036    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(6.1863) family taken to be 1)

    Null deviance: 287.77  on 149  degrees of freedom
Residual deviance: 157.42  on 138  degrees of freedom
AIC: 1061.6

Number of Fisher Scoring iterations: 1

              Theta:  6.186 
          Std. Err.:  0.987 

 2 x log-likelihood:  -1035.604 

Now overdispersion is taken into account and standard errors are larger:

car::compareCoefs(fit2, fit3)
Calls:
1: glm(formula = nwords ~ timebonding * ses + caregiving * ses + 
  timebonding * babysitter + caregiving * babysitter, family = poisson(link = 
  "log"), data = dat)
2: MASS::glm.nb(formula = nwords ~ timebonding * ses + caregiving * ses + 
  timebonding * babysitter + caregiving * babysitter, data = dat, init.theta =
   6.186322679, link = log)

                            Model 1   Model 2
(Intercept)                   2.526     2.472
SE                            0.134     0.292
                                             
timebonding                 0.01983   0.02393
SE                          0.00518   0.01127
                                             
sesmiddle                    -0.613    -0.660
SE                            0.159     0.338
                                             
seshigh                      -0.343    -0.433
SE                            0.212     0.415
                                             
caregiving                   0.0327    0.0287
SE                           0.0197    0.0453
                                             
babysitteryes               -0.0569    0.0730
SE                           0.1497    0.2751
                                             
timebonding:sesmiddle       0.02171   0.02007
SE                          0.00706   0.01510
                                             
timebonding:seshigh          0.0205    0.0232
SE                           0.0126    0.0236
                                             
sesmiddle:caregiving         0.0259    0.0410
SE                           0.0272    0.0594
                                             
seshigh:caregiving          -0.0670   -0.0545
SE                           0.0556    0.1047
                                             
timebonding:babysitteryes   0.00315  -0.00760
SE                          0.00886   0.01734
                                             
caregiving:babysitteryes  -0.000731  0.009204
SE                         0.038348  0.075962
                                             
# test statistics for the poisson model

data.frame(
    poisson = fit2$coefficients/sqrt(diag(vcov(fit2))),
    negative_binomial = fit3$coefficients/sqrt(diag(vcov(fit3)))
)
                              poisson negative_binomial
(Intercept)               18.89022175         8.4570188
timebonding                3.82970558         2.1234343
sesmiddle                 -3.84471431        -1.9508488
seshigh                   -1.62176628        -1.0441345
caregiving                 1.65805200         0.6345780
babysitteryes             -0.38006414         0.2655100
timebonding:sesmiddle      3.07673073         1.3287914
timebonding:seshigh        1.62039901         0.9829331
sesmiddle:caregiving       0.94945528         0.6897160
seshigh:caregiving        -1.20482778        -0.5206646
timebonding:babysitteryes  0.35512658        -0.4383909
caregiving:babysitteryes  -0.01906131         0.1211672