devtools::load_all() # if using the rproject dowloaded from the slides
# source("utils-glm.R") # if using a standard setup
library(here)
library(tidyr) # for data manipulation
library(dplyr) # for data manipulation
library(ggplot2) # plotting
library(car) # general utilities
library(effects) # for extracting and plotting effects 
library(emmeans) # for marginal means
dat <- read.csv(here("data", "tantrums.csv"))

Overview

The dataset tantrums.csv is about the number of tantrums of nrow(child) toddlers during two days at the nursery. The columns are:

  • id: identifier for the child
  • temperament: the temperament of the child as “easy” or “difficult”
  • attachment: the attachment of the child as “secure” or “insecure”
  • parent_se: an average self-esteem value of the parents (self report)
  • parent_skills: a score representing the teacher judgment about parenting skills
  • tantrums: the number of tantrums

We want to predict the number of tantrums as a function of these predictors.

  1. Importing data and check
    • in the presence of NA, remove the children
    • convert to factors the categorical variable with “difficult” and “insecure” as reference values
  2. Exploratory data analysis
  3. Model fitting with glm()
  4. Diagnostic
  5. Interpreting parameters
  6. Model selection
  7. What about interactions?

1. Importing data and check

Firstly we import the data into R:

dat <- read.csv("data/tantrums.csv")

Check the structure:

str(dat)
## 'data.frame':    122 obs. of  6 variables:
##  $ id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ temperament  : chr  "difficult" "easy" "difficult" "difficult" ...
##  $ attachment   : chr  "insecure" "secure" "secure" "secure" ...
##  $ parent_se    : int  3 4 9 8 4 6 10 6 10 4 ...
##  $ parent_skills: int  5 8 5 6 7 2 9 7 1 7 ...
##  $ tantrum      : int  1 0 1 0 0 10 0 2 7 0 ...

Check for NA:

sapply(dat, function(x) sum(is.na(x)))
##            id   temperament    attachment     parent_se parent_skills 
##             0             1             1             0             0 
##       tantrum 
##             2

So we have some NA values. We managed them according to the instructions:

dat <- dat[complete.cases(dat), ]
dat$id <- 1:nrow(dat) # restore the id
rownames(dat) <- NULL

Let’s convert the categorical variables into factor with the appropriate reference level:

dat$temperament <- factor(dat$temperament, levels = c("difficult", "easy"))
dat$temperament[1:5]
## [1] difficult easy      difficult difficult difficult
## Levels: difficult easy
dat$attachment <- factor(dat$attachment, levels = c("insecure", "secure"))
dat$attachment[1:5]
## [1] insecure secure   secure   secure   insecure
## Levels: insecure secure

2. Exploratory data analysis

Let’s compute some summary statistics and plots.

summary(dat)
##        id            temperament    attachment   parent_se     
##  Min.   :  1.00   difficult:32   insecure:39   Min.   : 1.000  
##  1st Qu.: 30.25   easy     :86   secure  :79   1st Qu.: 5.000  
##  Median : 59.50                                Median : 7.000  
##  Mean   : 59.50                                Mean   : 6.364  
##  3rd Qu.: 88.75                                3rd Qu.: 8.000  
##  Max.   :118.00                                Max.   :10.000  
##  parent_skills       tantrum     
##  Min.   : 1.000   Min.   : 0.00  
##  1st Qu.: 5.000   1st Qu.: 0.00  
##  Median : 6.000   Median : 1.00  
##  Mean   : 6.237   Mean   : 1.72  
##  3rd Qu.: 8.000   3rd Qu.: 2.00  
##  Max.   :10.000   Max.   :20.00
table(dat$temperament)
## 
## difficult      easy 
##        32        86
table(dat$attachment)
## 
## insecure   secure 
##       39       79
table(dat$attachment, dat$temperament)
##           
##            difficult easy
##   insecure        12   27
##   secure          20   59
par(mfrow = c(1,3))
hist(dat$parent_se)
hist(dat$parent_skills)
hist(dat$tantrum)

Let’s compute some bivariate relationships:

plot(dat$parent_se, dat$tantrum, pch = 19)

plot(dat$parent_skills, dat$tantrum, pch = 19)

boxplot(tantrum ~ temperament, data = dat)

boxplot(tantrum ~ attachment, data = dat)

3. Model fitting with glm()

We can start by fitting our null model with the poisson() family:

fit0 <- glm(tantrum ~ 1, family = poisson(link = "log"), data = dat)

What is the intercept here?

Then we can fit a model with the attachment effect:

fit1 <- glm(tantrum ~ parent_se, family = poisson(link = "log"), data = dat)
summary(fit1)
## 
## Call:
## glm(formula = tantrum ~ parent_se, family = poisson(link = "log"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1239  -1.8145  -0.8368   0.2666   7.3901  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.02631    0.22966   0.115   0.9088  
## parent_se    0.07870    0.03240   2.429   0.0151 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 421.11  on 117  degrees of freedom
## Residual deviance: 415.05  on 116  degrees of freedom
## AIC: 590.21
## 
## Number of Fisher Scoring iterations: 6

What about the overdispersion? What could be the reason?

Assuming that the attachment is the only variable that we have, we could estimate the degree of overdispersion:

sum(residuals(fit1, type = "pearson")^2)/fit1$df.residual
## [1] 5.065601
performance::check_overdispersion(fit1)
## # Overdispersion test
## 
##        dispersion ratio =   5.066
##   Pearson's Chi-Squared = 587.610
##                 p-value = < 0.001

Let’s have a look also at the residual plot:

plot_resid(fit1, type = "pearson")

# or in base R
residualPlots(fit1)

##           Test stat Pr(>|Test stat|)  
## parent_se    3.1933          0.07394 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is clear evidence of overdispersion. But we have several other variables so before using another model let’s fit everything:

fit_s <- glm(tantrum ~ attachment + temperament + parent_se + parent_skills, family = poisson(link = "log"), data = dat)
summary(fit_s)
## 
## Call:
## glm(formula = tantrum ~ attachment + temperament + parent_se + 
##     parent_skills, family = poisson(link = "log"), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9635  -1.0742  -0.6412   0.2012   7.7786  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.19125    0.34888   9.147  < 2e-16 ***
## attachmentsecure -0.05147    0.15964  -0.322    0.747    
## temperamenteasy  -0.82435    0.14127  -5.835 5.36e-09 ***
## parent_se        -0.01881    0.03605  -0.522    0.602    
## parent_skills    -0.38883    0.03454 -11.257  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 421.11  on 117  degrees of freedom
## Residual deviance: 218.91  on 113  degrees of freedom
## AIC: 400.07
## 
## Number of Fisher Scoring iterations: 6

Let’s check again overdispersion and pearson residuals:

plot_resid(fit_s, type = "pearson")

# or in base R
residualPlots(fit_s)

##               Test stat Pr(>|Test stat|)    
## attachment                                  
## temperament                                 
## parent_se        9.5838         0.001963 ** 
## parent_skills   29.1366        6.745e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The majority of the distribution seems ok, but there are some values with very high residuals and the overdispersion is still present:

sum(residuals(fit_s, type = "pearson")^2)/fit_s$df.residual
## [1] 8.14082
performance::check_overdispersion(fit_s)
## # Overdispersion test
## 
##        dispersion ratio =   8.141
##   Pearson's Chi-Squared = 919.913
##                 p-value = < 0.001

4. Diagnostic

Another reason for overdispersion could be the presence of outliers and influential points. Let’s have a look at the Cook distances:

car::influenceIndexPlot(fit_s, vars = c("cook", "hat", "Studentized"))

There are two values (111 and 112) with a very high cook distance and very high studentized residual. We can try to fit a model without these values and check what happens to the model:

dat_no_out <- dat[-c(111, 112), ]
fit_no_out <- glm(tantrum ~ attachment + temperament + parent_se + parent_skills, family = poisson(link = "log"), data = dat_no_out)
summary(fit_no_out)
## 
## Call:
## glm(formula = tantrum ~ attachment + temperament + parent_se + 
##     parent_skills, family = poisson(link = "log"), data = dat_no_out)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9924  -1.0703  -0.6624   0.1924   7.7780  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.19683    0.35022   9.128  < 2e-16 ***
## attachmentsecure -0.04684    0.16036  -0.292     0.77    
## temperamenteasy  -0.86060    0.14337  -6.003 1.94e-09 ***
## parent_se        -0.01798    0.03632  -0.495     0.62    
## parent_skills    -0.38676    0.03450 -11.210  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 418.84  on 115  degrees of freedom
## Residual deviance: 216.83  on 111  degrees of freedom
## AIC: 392.11
## 
## Number of Fisher Scoring iterations: 6

The model seems to be clearly improved, especially in terms of overdispersion:

sum(residuals(fit_no_out, type = "pearson")^2)/fit_no_out$df.residual
## [1] 8.28419
performance::check_overdispersion(fit_no_out)
## # Overdispersion test
## 
##        dispersion ratio =   8.284
##   Pearson's Chi-Squared = 919.545
##                 p-value = < 0.001

We can also compare the two models in terms of coefficients:

car::compareCoefs(fit_s, fit_no_out)
## Calls:
## 1: glm(formula = tantrum ~ attachment + temperament + parent_se + 
##   parent_skills, family = poisson(link = "log"), data = dat)
## 2: glm(formula = tantrum ~ attachment + temperament + parent_se + 
##   parent_skills, family = poisson(link = "log"), data = dat_no_out)
## 
##                  Model 1 Model 2
## (Intercept)        3.191   3.197
## SE                 0.349   0.350
##                                 
## attachmentsecure -0.0515 -0.0468
## SE                0.1596  0.1604
##                                 
## temperamenteasy   -0.824  -0.861
## SE                 0.141   0.143
##                                 
## parent_se        -0.0188 -0.0180
## SE                0.0360  0.0363
##                                 
## parent_skills    -0.3888 -0.3868
## SE                0.0345  0.0345
## 

In fact, there are some coefficients with different values. We can check also the dfbeta plots:

dfbeta_plot(fit_s)

The previous observations seems to do not affect the estimated parameters but they impact the overall model fit, deviance and residuals.

Let’s have a look at residuals now:

car::residualPlot(fit_no_out)

There is still some strange pattern but the majority of the distribution seems to be between -1 and 1.

5. Interpreting parameters

Before anything else, just plot the effects:

plot(allEffects(fit_no_out))

Now we can interpret model parameters:

summary(fit_no_out)
## 
## Call:
## glm(formula = tantrum ~ attachment + temperament + parent_se + 
##     parent_skills, family = poisson(link = "log"), data = dat_no_out)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9924  -1.0703  -0.6624   0.1924   7.7780  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.19683    0.35022   9.128  < 2e-16 ***
## attachmentsecure -0.04684    0.16036  -0.292     0.77    
## temperamenteasy  -0.86060    0.14337  -6.003 1.94e-09 ***
## parent_se        -0.01798    0.03632  -0.495     0.62    
## parent_skills    -0.38676    0.03450 -11.210  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 418.84  on 115  degrees of freedom
## Residual deviance: 216.83  on 111  degrees of freedom
## AIC: 392.11
## 
## Number of Fisher Scoring iterations: 6

The (Intercept) is the expected number of tantrums for “insecure”, “difficult” children where parent_skills are rated as 0 and parent self esteem is 0, thus 24.4550038. Similarly to the binomial lab, we could center the two numerical variables to have a more meaningful interpretation or we can use the predict function to obtain the values that we want.

predict(fit_no_out, newdata = data.frame(attachment = "insecure", 
                                         temperament = "difficult",
                                         parent_se = mean(dat$parent_se), 
                                         parent_skills = mean(dat$parent_skills)),
        type = "response") # same as exp(prediction)
##        1 
## 1.954296

The attachmentsecure is the expected difference in log number of tantrums between secure - insecure attachment, controlling for other variables:

emmeans(fit_no_out, pairwise~attachment)
## $emmeans
##  attachment emmean    SE  df asymp.LCL asymp.UCL
##  insecure    0.222 0.142 Inf   -0.0564     0.501
##  secure      0.175 0.119 Inf   -0.0572     0.408
## 
## Results are averaged over the levels of: temperament 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast          estimate   SE  df z.ratio p.value
##  insecure - secure   0.0468 0.16 Inf   0.292  0.7702
## 
## Results are averaged over the levels of: temperament 
## Results are given on the log (not the response) scale.

In terms of the response scale, we can intepret it as the multiplicative increase of the number of tantrums from secure to insecure attachment:

exp(coef(fit_no_out)["attachmentsecure"])
## attachmentsecure 
##        0.9542445

Moving from insecure from secure attachment, there is a decrease in the expected number of tantrums of 4.5755534 %.

The temperamenteasy can be interpreted in the same way:

emmeans(fit_no_out, pairwise~temperament)
## $emmeans
##  temperament emmean    SE  df asymp.LCL asymp.UCL
##  difficult    0.629 0.129 Inf     0.377   0.88147
##  easy        -0.232 0.123 Inf    -0.472   0.00919
## 
## Results are averaged over the levels of: attachment 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast         estimate    SE  df z.ratio p.value
##  difficult - easy    0.861 0.143 Inf   6.003  <.0001
## 
## Results are averaged over the levels of: attachment 
## Results are given on the log (not the response) scale.
exp(coef(fit_no_out)["temperamenteasy"])
## temperamenteasy 
##       0.4229074

So there is a reduction of the 57.7092602 % by moving from difficult to easy temperament.

parent_se and parent_skills are interpreted similarly. The coefficient represent the increase/decrease in the log number of tantrums for a unit increase in the predictors.

exp(coef(fit_no_out)[4:5])
##     parent_se parent_skills 
##     0.9821761     0.6792530

So the number of tantrums seems to be unaffected by the parents self-esteem but as the parent skills increases there is a reduction in the number of tantrums.

6. Model selection

Let’s compare the model with and without the parent_se terms that appear to be not very useful:

fit_no_parent_se <- update(fit_no_out, . ~ . -parent_se)
summary(fit_no_parent_se)
## 
## Call:
## glm(formula = tantrum ~ attachment + temperament + parent_skills, 
##     family = poisson(link = "log"), data = dat_no_out)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0269  -1.0863  -0.6583   0.1722   7.7375  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.06076    0.21747  14.075  < 2e-16 ***
## attachmentsecure -0.05173    0.16007  -0.323    0.747    
## temperamenteasy  -0.86591    0.14292  -6.059 1.37e-09 ***
## parent_skills    -0.38152    0.03265 -11.685  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 418.84  on 115  degrees of freedom
## Residual deviance: 217.08  on 112  degrees of freedom
## AIC: 390.35
## 
## Number of Fisher Scoring iterations: 6
anova(fit_no_parent_se, fit_no_out, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: tantrum ~ attachment + temperament + parent_skills
## Model 2: tantrum ~ attachment + temperament + parent_se + parent_skills
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       112     217.07                     
## 2       111     216.83  1   0.2447   0.6208
drop1(fit_no_out, test = "LRT")
## Single term deletions
## 
## Model:
## tantrum ~ attachment + temperament + parent_se + parent_skills
##               Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>             216.83 392.11                      
## attachment     1   216.92 390.19   0.085    0.7709    
## temperament    1   251.77 425.05  34.938 3.404e-09 ***
## parent_se      1   217.08 390.35   0.245    0.6208    
## parent_skills  1   362.88 536.16 146.049 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or using the MuMIn::dredge() function:

fit_no_out <- update(fit_no_out, na.action = na.fail)
MuMIn::dredge(fit_no_out, rank = "AIC")
## Global model call: glm(formula = tantrum ~ attachment + temperament + parent_se + 
##     parent_skills, family = poisson(link = "log"), data = dat_no_out, 
##     na.action = na.fail)
## ---
## Model selection table 
##       (Int) att   prn_se prn_skl tmp df   logLik   AIC  delta weight
## 13  3.01400              -0.3794   +  3 -191.229 388.5   0.00  0.508
## 15  3.16000     -0.01865 -0.3851   +  4 -191.097 390.2   1.74  0.213
## 14  3.06100   +          -0.3815   +  4 -191.177 390.4   1.90  0.197
## 16  3.19700   + -0.01798 -0.3868   +  5 -191.055 392.1   3.65  0.082
## 5   2.51900              -0.3866      2 -208.969 421.9  33.48  0.000
## 7   2.78600     -0.03339 -0.3968      3 -208.538 423.1  34.62  0.000
## 6   2.51100   +          -0.3864      3 -208.967 423.9  35.48  0.000
## 8   2.77100   + -0.03406 -0.3964      4 -208.524 425.0  36.59  0.000
## 12  0.47210   +  0.07889           +  4 -264.080 536.2 147.70  0.000
## 11  0.64420      0.07855           +  3 -265.559 537.1 148.66  0.000
## 10  0.97980   +                    +  3 -267.015 540.0 151.57  0.000
## 9   1.15100                        +  2 -268.505 541.0 152.55  0.000
## 3   0.04846      0.07390              2 -289.450 582.9 194.44  0.000
## 4  -0.07410   +  0.07321              3 -288.730 583.5 195.00  0.000
## 1   0.52960                           1 -292.062 586.1 197.66  0.000
## 2   0.39690   +                       2 -291.273 586.5 198.09  0.000
## Models ranked by AIC(x)

7. What about interactions?

We can also have a look at interactions, try by yourself to explore interactions between numerical (parent_skills and parent_se) and categorical (attachment and temperament) variables. I’m only interested in 1 continuous variable interacting with 1 categorical variable.

  • fit a separate model for each interaction
  • interpret the model parameters and the analysis of deviance table (car:: something :)) or using a model comparison (Likelihood Ratio Test) for testing the interaction
  • plot the model effects
  • comment the results
---
title: "Statistical Methods and Data Analysis in Developmental Psychology"
subtitle: "Lab 10"
author: "Filippo Gambarota"
output: 
    html_document:
        code_folding: show
        toc: true
        toc_float: true
        code_download: true
date: "Updated on `r Sys.Date()`"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      message = FALSE,
                      warning = FALSE,
                      dev = "svg")
```

```{r packages, message=FALSE, warning=FALSE}
devtools::load_all() # if using the rproject dowloaded from the slides
# source("utils-glm.R") # if using a standard setup
library(here)
library(tidyr) # for data manipulation
library(dplyr) # for data manipulation
library(ggplot2) # plotting
library(car) # general utilities
library(effects) # for extracting and plotting effects 
library(emmeans) # for marginal means
```

```{r options, include = FALSE}
theme_set(theme_minimal(base_size = 15))
```

```{r script, echo=FALSE}
## Link in Github repo
downloadthis::download_link(
  link = download_link("https://github.com/stat-teaching/SMDA-2023/blob/master/labs/lab10.R"),
  button_label = "Download the R script",
  button_type = "danger",
  has_icon = TRUE,
  icon = "fa fa-save",
  self_contained = FALSE
)
```

```{r data, echo=FALSE}
downloadthis::download_file(path = here("data", "tantrums.csv"))
```

```{r loading-data}
dat <- read.csv(here("data", "tantrums.csv"))
```

# Overview

The dataset `tantrums.csv` is about the number of tantrums of `nrow(child)` toddlers during two days at the nursery. The columns are:

- `id`: identifier for the child
- `temperament`: the temperament of the child as "easy" or "difficult"
- `attachment`: the attachment of the child as "secure" or "insecure"
- `parent_se`: an average self-esteem value of the parents (self report)
- `parent_skills`: a score representing the teacher judgment about parenting skills
- `tantrums`: the number of tantrums

We want to predict the number of tantrums as a function of these predictors.

1. Importing data and check
    - in the presence of `NA`, remove the children
    - convert to factors the categorical variable with "difficult" and "insecure" as reference values
2. Exploratory data analysis
3. Model fitting with `glm()`
4. Diagnostic
5. Interpreting parameters
6. Model selection
7. What about interactions?

# 1. Importing data and check

Firstly we import the data into R:

```{r eval = FALSE}
dat <- read.csv("data/tantrums.csv")
```

Check the structure:

```{r}
str(dat)
```

Check for `NA`:

```{r}
sapply(dat, function(x) sum(is.na(x)))
```

So we have some `NA` values. We managed them according to the instructions:

```{r}
dat <- dat[complete.cases(dat), ]
dat$id <- 1:nrow(dat) # restore the id
rownames(dat) <- NULL
```

Let's convert the categorical variables into factor with the appropriate reference level:

```{r}
dat$temperament <- factor(dat$temperament, levels = c("difficult", "easy"))
dat$temperament[1:5]

dat$attachment <- factor(dat$attachment, levels = c("insecure", "secure"))
dat$attachment[1:5]
```






# 2. Exploratory data analysis

Let's compute some summary statistics and plots.

```{r}
summary(dat)
```

```{r}
table(dat$temperament)
table(dat$attachment)
table(dat$attachment, dat$temperament)
```

```{r}
par(mfrow = c(1,3))
hist(dat$parent_se)
hist(dat$parent_skills)
hist(dat$tantrum)
```

Let's compute some bivariate relationships:

```{r}
plot(dat$parent_se, dat$tantrum, pch = 19)
plot(dat$parent_skills, dat$tantrum, pch = 19)
```

```{r}
boxplot(tantrum ~ temperament, data = dat)
boxplot(tantrum ~ attachment, data = dat)
```

# 3. Model fitting with `glm()`

We can start by fitting our null model with the `poisson()` family:

```{r}
fit0 <- glm(tantrum ~ 1, family = poisson(link = "log"), data = dat)
```

What is the intercept here?

Then we can fit a model with the attachment effect:

```{r}
fit1 <- glm(tantrum ~ parent_se, family = poisson(link = "log"), data = dat)
summary(fit1)
```

What about the overdispersion? What could be the reason?

Assuming that the `attachment` is the only variable that we have, we could estimate the degree of overdispersion:

```{r}
sum(residuals(fit1, type = "pearson")^2)/fit1$df.residual
performance::check_overdispersion(fit1)
```

Let's have a look also at the residual plot:

```{r}
plot_resid(fit1, type = "pearson")

# or in base R
residualPlots(fit1)
```

There is clear evidence of overdispersion. But we have several other variables so before using another model let's fit everything:

```{r}
fit_s <- glm(tantrum ~ attachment + temperament + parent_se + parent_skills, family = poisson(link = "log"), data = dat)
summary(fit_s)
```

Let's check again overdispersion and pearson residuals:

```{r}
plot_resid(fit_s, type = "pearson")

# or in base R
residualPlots(fit_s)
```

The majority of the distribution seems ok, but there are some values with very high residuals and the overdispersion is still present:

```{r}
sum(residuals(fit_s, type = "pearson")^2)/fit_s$df.residual
performance::check_overdispersion(fit_s)
```

# 4. Diagnostic

Another reason for overdispersion could be the presence of outliers and influential points. Let's have a look at the Cook distances:

```{r}
car::influenceIndexPlot(fit_s, vars = c("cook", "hat", "Studentized"))
```

There are two values (111 and 112) with a very high cook distance and very high studentized residual. We can try to fit a model without these values and check what happens to the model:

```{r}
dat_no_out <- dat[-c(111, 112), ]
fit_no_out <- glm(tantrum ~ attachment + temperament + parent_se + parent_skills, family = poisson(link = "log"), data = dat_no_out)
summary(fit_no_out)
```

The model seems to be clearly improved, especially in terms of overdispersion:

```{r}
sum(residuals(fit_no_out, type = "pearson")^2)/fit_no_out$df.residual
performance::check_overdispersion(fit_no_out)
```

We can also compare the two models in terms of coefficients:

```{r}
car::compareCoefs(fit_s, fit_no_out)
```

In fact, there are some coefficients with different values. We can check also the dfbeta plots:

```{r}
dfbeta_plot(fit_s)
```

The previous observations seems to do not affect the estimated parameters but they impact the overall model fit, deviance and residuals.

Let's have a look at residuals now:

```{r}
car::residualPlot(fit_no_out)
```

There is still some strange pattern but the majority of the distribution seems to be between -1 and 1.

# 5. Interpreting parameters

Before anything else, just plot the effects:

```{r, fig.width=10, fig.height=10}
plot(allEffects(fit_no_out))
```


Now we can interpret model parameters:

```{r}
summary(fit_no_out)
```
The `(Intercept)` is the expected number of tantrums for "insecure", "difficult" children where parent_skills are rated as 0 and parent self esteem is 0, thus `r exp(coef(fit_no_out)[1])`. Similarly to the binomial lab, we could center the two numerical variables to have a more meaningful interpretation or we can use the `predict` function to obtain the values that we want.

```{r}
predict(fit_no_out, newdata = data.frame(attachment = "insecure", 
                                         temperament = "difficult",
                                         parent_se = mean(dat$parent_se), 
                                         parent_skills = mean(dat$parent_skills)),
        type = "response") # same as exp(prediction)
```

The `attachmentsecure` is the expected difference in log number of tantrums between `secure - insecure` attachment, controlling for other variables:

```{r}
emmeans(fit_no_out, pairwise~attachment)
```

In terms of the response scale, we can intepret it as the multiplicative increase of the number of tantrums from secure to insecure attachment:

```{r}
exp(coef(fit_no_out)["attachmentsecure"])
```

Moving from insecure from secure attachment, there is a decrease in the expected number of tantrums of `r 100 - exp(coef(fit_no_out)["attachmentsecure"]) * 100` %.

The `temperamenteasy` can be interpreted in the same way:

```{r}
emmeans(fit_no_out, pairwise~temperament)
```

```{r}
exp(coef(fit_no_out)["temperamenteasy"])
```

So there is a reduction of the `r 100 - exp(coef(fit_no_out)["temperamenteasy"]) * 100` % by moving from difficult to easy temperament.

`parent_se` and `parent_skills` are interpreted similarly. The coefficient represent the increase/decrease in the log number of tantrums for a unit increase in the predictors.

```{r}
exp(coef(fit_no_out)[4:5])
```

So the number of tantrums seems to be unaffected by the parents self-esteem but as the parent skills increases there is a reduction in the number of tantrums.

# 6. Model selection

Let's compare the model with and without the `parent_se` terms that appear to be not very useful:

```{r}
fit_no_parent_se <- update(fit_no_out, . ~ . -parent_se)
summary(fit_no_parent_se)
anova(fit_no_parent_se, fit_no_out, test = "LRT")
```

```{r}
drop1(fit_no_out, test = "LRT")
```

Or using the `MuMIn::dredge()` function:

```{r}
fit_no_out <- update(fit_no_out, na.action = na.fail)
MuMIn::dredge(fit_no_out, rank = "AIC")
```

# 7. What about interactions?

We can also have a look at interactions, try by yourself to explore interactions between numerical (`parent_skills` and `parent_se`) and categorical (`attachment` and `temperament`) variables. I'm only interested in 1 continuous variable interacting with 1 categorical variable.

- fit a separate model for each interaction
- interpret the model parameters and the analysis of deviance table (`car::` something :)) or using a model comparison (Likelihood Ratio Test) for testing the interaction
- plot the model effects
- comment the results
