devtools::load_all() # if using the rproject dowloaded from the slides
# source("utils-glm.R") # if using a standard setup
library(here)
library(tidyr) # for data manipulation
library(dplyr) # for data manipulation
library(ggplot2) # plotting
library(car) # general utilities
library(effects) # for extracting and plotting effects
library(emmeans) # for marginal meansLab 5
data("nwords")
dat <- nwordsOverview
This dataset nwords represent a developmental psychology study investigating the factors that influence children language development. The dataset includes information about the number of words in a a language task, and some potential predictors: caregiver behavior, socioeconomic status, amount of time spent with the child during a week and the presence of a baby-sitter or not.
child: numeric variable representing the child ID number.timebonding: numeric variable representing the average hours per week of child-parent activitiesnwords: numeric variable representing the number of words produced by the child in a language task.caregiving: numeric variable representing the caregiver’s behavior in a parent-child interaction task, measured on a scale from 0 to 10ses: categorical variable representing the family socioeconomic status, with three levels: “Low”, “Middle”, and “High”
- Importing data and overall check
- think about factors levels, scale of the numerical variables, etc.
- Exploratory data analysis of predictors and the relationships between predictors and the number of words
- Model fitting with
glm()andpoissonfamily starting from additive models and then adding the interactions. - Model diagnostic of the chosen model
- overdispersion
- residuals
- outliers and influential points
- Interpreting the effects and plotting
- Fit a quasi-poisson and negative binomial version of the chosen model, and check how parameters are affected
1. Importing data and overall check
load(here("data/nwords.rda"))str(dat)'data.frame': 150 obs. of 6 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ timebonding: num 19 16 32 15 33 10 27 29 26 14 ...
$ caregiving : num 8 8 12 7 9 7 8 11 9 6 ...
$ ses : Factor w/ 3 levels "low","middle",..: 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "contrasts")= num [1:3, 1:2] 0 1 0 0 0 1
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:3] "low" "middle" "high"
.. .. ..$ : chr [1:2] "2" "3"
$ babysitter : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "contrasts")= num [1:2, 1] 0 1
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "no" "yes"
.. .. ..$ : chr "2"
$ nwords : int 18 28 46 24 27 13 16 14 27 20 ...
Check for NA values:
sapply(dat, function(x) sum(is.na(x))) id timebonding caregiving ses babysitter nwords
0 0 0 0 0 0
Everything seems good, we do not have NA values.
Let’s convert categorical variables into factor setting the appropriate order:
ses: low, middle, highbabysitter: no, yes
dat$ses <- factor(dat$ses, levels = c("low", "middle", "high"))
dat$babysitter <- factor(dat$babysitter, levels = c("no", "yes"))
levels(dat$ses)[1] "low" "middle" "high"
levels(dat$babysitter)[1] "no" "yes"
timebonding and caregiving are the two numerical predictors. Given that we are going to fit main effects and interactions and we want to interpret the intercept and test the interaction on a meaningful point, we create two centered versions of these variables:
dat$timebonding0 <- dat$timebonding - mean(dat$timebonding)
dat$caregiving0 <- dat$caregiving - mean(dat$caregiving)Then we can see what happen to the variables:
cols <- grepl("timebonding|caregiving", names(dat))
lapply(dat[, cols], function(x) round(c(mean = mean(x), sd = sd(x)), 3))$timebonding
mean sd
14.980 8.755
$caregiving
mean sd
5.587 2.560
$timebonding0
mean sd
0.000 8.755
$caregiving0
mean sd
0.00 2.56
2. Exploratory data analysis
summary(dat) id timebonding caregiving ses babysitter
Min. : 1.00 Min. : 0.00 Min. : 0.000 low :45 no :103
1st Qu.: 38.25 1st Qu.: 8.00 1st Qu.: 4.000 middle:75 yes: 47
Median : 75.50 Median :15.50 Median : 5.000 high :30
Mean : 75.50 Mean :14.98 Mean : 5.587
3rd Qu.:112.75 3rd Qu.:21.75 3rd Qu.: 7.000
Max. :150.00 Max. :36.00 Max. :13.000
nwords timebonding0 caregiving0
Min. : 0.00 Min. :-14.98 Min. :-5.5867
1st Qu.: 9.00 1st Qu.: -6.98 1st Qu.:-1.5867
Median :16.00 Median : 0.52 Median :-0.5867
Mean :18.75 Mean : 0.00 Mean : 0.0000
3rd Qu.:25.00 3rd Qu.: 6.77 3rd Qu.: 1.4133
Max. :58.00 Max. : 21.02 Max. : 7.4133
Let’s do some plotting of predictors:
par(mfrow = c(2,2))
hist(dat$timebonding)
hist(dat$caregiving)
barplot(table(dat$babysitter))
barplot(table(dat$ses))Comments?
Also the response variable:
hist(dat$nwords, probability = TRUE)
lines(density(dat$nwords), col = "salmon", lwd = 2)Let’s plot the theoretical distributions, Poisson and Gaussian:
m <- mean(dat$nwords)
s <- sd(dat$nwords)
xs <- seq(0, 50, 1)
hist(dat$nwords, probability = TRUE, xlim = c(-10, 50), ylim = c(0, 0.1))
curve(dnorm(x, m, s), add = TRUE, col = "green", lwd = 2)
lines(xs, dpois(xs, m), col = "red", lwd = 2)
lines(xs, dpois(xs, m), col = "red", lwd = 2)Comments?
Let’s plot some bivariate distributions:
# caregiving ~ timebonding
r <- cor(dat$timebonding, dat$caregiving)
plot(dat$timebonding, dat$caregiving, pch = 19)
abline(lm(dat$caregiving ~ dat$timebonding), col = "red", lwd = 2)
text(30, 2, label = paste("r =", round(r, 2)))par(mfrow = c(1,2))
boxplot(caregiving ~ ses, data = dat)
boxplot(timebonding ~ ses, data = dat)mosaicplot(table(dat$ses, dat$babysitter))# or
barplot(table(dat$babysitter, dat$ses), col = c("red", "green"), beside = TRUE)
legend(7,45, legend = c("no", "yes"), fill = c("red", "green"))par(mfrow = c(1,2))
boxplot(caregiving ~ babysitter, data = dat)
boxplot(timebonding ~ babysitter, data = dat)Then the relationships with the nwords variable:
par(mfrow = c(2,2))
plot(dat$timebonding, dat$nwords, pch = 19)
plot(dat$caregiving, dat$nwords, pch = 19)
boxplot(dat$nwords ~ dat$ses)
boxplot(dat$nwords ~ dat$babysitter)Comments?
Finally some interactions plot (with lm):
par(mfrow = c(2,1))
colors <- c(low = "red", middle = "blue", high = "green")
plot(dat$timebonding, dat$nwords, col = colors[dat$ses], pch = 19)
lms <- lapply(split(dat, dat$ses), function(x) lm(nwords ~ timebonding, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
plot(dat$caregiving, dat$nwords, col = colors[dat$ses], pch = 19)
lms <- lapply(split(dat, dat$ses), function(x) lm(nwords ~ caregiving, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
par(mfrow = c(2,1))
colors <- c(no = "orange", yes = "purple")
plot(dat$timebonding, dat$nwords, col = colors[dat$babysitter], pch = 19)
lms <- lapply(split(dat, dat$babysitter), function(x) lm(nwords ~ timebonding, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))[[1]]
NULL
[[2]]
NULL
plot(dat$caregiving, dat$nwords, col = colors[dat$babysitter], pch = 19)
lms <- lapply(split(dat, dat$babysitter), function(x) lm(nwords ~ caregiving, data = x))
lapply(1:length(lms), function(i) abline(lms[[i]], col = colors[i], lwd = 2))[[1]]
NULL
[[2]]
NULL
3. Model fitting with glm() and poisson
Let’s start by using an additive model:
fit <- glm(nwords ~ timebonding + caregiving + babysitter + ses, family = poisson(link = "log"), data = dat)
summary(fit)
Call:
glm(formula = nwords ~ timebonding + caregiving + babysitter +
ses, family = poisson(link = "log"), data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.143168 0.086143 24.879 < 2e-16 ***
timebonding 0.033816 0.003239 10.439 < 2e-16 ***
caregiving 0.042120 0.012418 3.392 0.000694 ***
babysitteryes 0.022314 0.048813 0.457 0.647577
sesmiddle -0.007068 0.049779 -0.142 0.887094
seshigh -0.221124 0.079552 -2.780 0.005442 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1085.21 on 149 degrees of freedom
Residual deviance: 612.78 on 144 degrees of freedom
AIC: 1307.4
Number of Fisher Scoring iterations: 5
And always plotting before anything else:
plot(allEffects(fit))Comments? How could you describe the results? Something different from the descriptive statistics?
car::residualPlots(fit) Test stat Pr(>|Test stat|)
timebonding 5.5386 0.01860 *
caregiving 4.9330 0.02635 *
babysitter
ses
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comments? Are we missing something?
dat |>
ggplot(aes(x = timebonding, y = nwords, color = ses)) +
geom_point() +
stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)dat |>
ggplot(aes(x = caregiving, y = nwords, color = ses)) +
geom_point() +
stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)dat |>
ggplot(aes(x = timebonding, y = nwords, color = babysitter)) +
geom_point() +
stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)dat |>
ggplot(aes(x = caregiving, y = nwords, color = babysitter)) +
geom_point() +
stat_smooth(method = "glm", method.args = list(family = poisson()), se = FALSE)Let’s add some interactions:
fit2 <- glm(nwords ~ timebonding*ses + caregiving*ses + timebonding*babysitter + caregiving*babysitter, family = poisson(link = "log"), data = dat)
summary(fit2)
Call:
glm(formula = nwords ~ timebonding * ses + caregiving * ses +
timebonding * babysitter + caregiving * babysitter, family = poisson(link = "log"),
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.525633 0.133700 18.890 < 2e-16 ***
timebonding 0.019826 0.005177 3.830 0.000128 ***
sesmiddle -0.613019 0.159444 -3.845 0.000121 ***
seshigh -0.343371 0.211727 -1.622 0.104853
caregiving 0.032736 0.019744 1.658 0.097307 .
babysitteryes -0.056880 0.149660 -0.380 0.703898
timebonding:sesmiddle 0.021714 0.007058 3.077 0.002093 **
timebonding:seshigh 0.020462 0.012628 1.620 0.105147
sesmiddle:caregiving 0.025867 0.027244 0.949 0.342389
seshigh:caregiving -0.066959 0.055576 -1.205 0.228270
timebonding:babysitteryes 0.003147 0.008862 0.355 0.722495
caregiving:babysitteryes -0.000731 0.038348 -0.019 0.984792
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1085.21 on 149 degrees of freedom
Residual deviance: 582.35 on 138 degrees of freedom
AIC: 1289
Number of Fisher Scoring iterations: 5
plot(allEffects(fit2))4. Model fitting with MASS::glm.nb()
There is still evidence for overdispersion, even after including all predictors and a series of interactions. Let’s assume that this is our most complex model, we need to take into account the overdispersion:
performance::check_overdispersion(fit2)# Overdispersion test
dispersion ratio = 4.352
Pearson's Chi-Squared = 600.567
p-value = < 0.001
fit3 <- MASS::glm.nb(nwords ~ timebonding*ses + caregiving*ses + timebonding*babysitter + caregiving*babysitter, data = dat)
summary(fit3)
Call:
MASS::glm.nb(formula = nwords ~ timebonding * ses + caregiving *
ses + timebonding * babysitter + caregiving * babysitter,
data = dat, init.theta = 6.186322679, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.471985 0.292300 8.457 <2e-16 ***
timebonding 0.023927 0.011268 2.123 0.0337 *
sesmiddle -0.659979 0.338304 -1.951 0.0511 .
seshigh -0.432890 0.414592 -1.044 0.2964
caregiving 0.028743 0.045294 0.635 0.5257
babysitteryes 0.073044 0.275110 0.266 0.7906
timebonding:sesmiddle 0.020069 0.015103 1.329 0.1839
timebonding:seshigh 0.023178 0.023580 0.983 0.3256
sesmiddle:caregiving 0.040990 0.059430 0.690 0.4904
seshigh:caregiving -0.054526 0.104723 -0.521 0.6026
timebonding:babysitteryes -0.007602 0.017341 -0.438 0.6611
caregiving:babysitteryes 0.009204 0.075962 0.121 0.9036
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(6.1863) family taken to be 1)
Null deviance: 287.77 on 149 degrees of freedom
Residual deviance: 157.42 on 138 degrees of freedom
AIC: 1061.6
Number of Fisher Scoring iterations: 1
Theta: 6.186
Std. Err.: 0.987
2 x log-likelihood: -1035.604
Now overdispersion is taken into account and standard errors are larger:
car::compareCoefs(fit2, fit3)Calls:
1: glm(formula = nwords ~ timebonding * ses + caregiving * ses +
timebonding * babysitter + caregiving * babysitter, family = poisson(link =
"log"), data = dat)
2: MASS::glm.nb(formula = nwords ~ timebonding * ses + caregiving * ses +
timebonding * babysitter + caregiving * babysitter, data = dat, init.theta =
6.186322679, link = log)
Model 1 Model 2
(Intercept) 2.526 2.472
SE 0.134 0.292
timebonding 0.01983 0.02393
SE 0.00518 0.01127
sesmiddle -0.613 -0.660
SE 0.159 0.338
seshigh -0.343 -0.433
SE 0.212 0.415
caregiving 0.0327 0.0287
SE 0.0197 0.0453
babysitteryes -0.0569 0.0730
SE 0.1497 0.2751
timebonding:sesmiddle 0.02171 0.02007
SE 0.00706 0.01510
timebonding:seshigh 0.0205 0.0232
SE 0.0126 0.0236
sesmiddle:caregiving 0.0259 0.0410
SE 0.0272 0.0594
seshigh:caregiving -0.0670 -0.0545
SE 0.0556 0.1047
timebonding:babysitteryes 0.00315 -0.00760
SE 0.00886 0.01734
caregiving:babysitteryes -0.000731 0.009204
SE 0.038348 0.075962
# test statistics for the poisson model
data.frame(
poisson = fit2$coefficients/sqrt(diag(vcov(fit2))),
negative_binomial = fit3$coefficients/sqrt(diag(vcov(fit3)))
) poisson negative_binomial
(Intercept) 18.89022175 8.4570188
timebonding 3.82970558 2.1234343
sesmiddle -3.84471431 -1.9508488
seshigh -1.62176628 -1.0441345
caregiving 1.65805200 0.6345780
babysitteryes -0.38006414 0.2655100
timebonding:sesmiddle 3.07673073 1.3287914
timebonding:seshigh 1.62039901 0.9829331
sesmiddle:caregiving 0.94945528 0.6897160
seshigh:caregiving -1.20482778 -0.5206646
timebonding:babysitteryes 0.35512658 -0.4383909
caregiving:babysitteryes -0.01906131 0.1211672