?family
Filippo Gambarota
University of Padova
2023
Last Update: 2023-11-29
Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle
Despite the specific applications, Monte Carlo simulations follows a similar pattern:
In R there are several functions to generate random numbers and they are linked to specific probability distributions. You can type ?family()
to see available distributions for glm
.
?family
In fact, there are other useful distributions not listed in ?family()
, because they are not part of glm
. For example the beta
or the unif
(uniform) distributions. Use ?Distributions
for a complete list:
?Distributions
However, it is always possible to include other distributions with packages. For example the MASS::mvrnorm()
implement the multivariate normal distribution or the extraDistr::rhcauchy()
for a series of truncated distributions.
The general pattern is always the same. There are 4 functions called r
, p
, q
and d
combined with a distribution e.g. norm
creating several utilities. For example, rnorm()
generate number from a normal distribution.
Monte Carlo simulations are used for several purposes:
A classical example is estimating the standard error (SE) of a statistics. For example, we know that the SE of a sample mean is:
\[ \sigma_\overline x = \frac{s_x}{\sqrt{n_x}} \]
Where \(s_x\) is the standard deviation of \(x\) and \(n_x\) is the sample size.
However we are not good in deriving the SE analytically. We know that the SE is the standard deviation of the sampling distribution of a statistics.
The sampling distribution is the distribution obtained by calculating the statistics (in this case the mean) on all possible (or a very big number) samples of size \(n\).
We can solve the problems creating a very simple Monte Carlo Simulation
We simulate 10000 samples of size \(n\) by a normal distribution with \(\mu = 10\) and \(\sigma = 5\). We calculate the mean \(\overline x\) for each iteration and then we calculate the standard deviation of the vectors of means.
The general workflow is the following:
Let’s simulate a simple linear model (i.e., GLM with a Gaussian random component and identity link function).
\[ \hat y_i = \beta_0 + \beta_1x_i + \epsilon_i \]
In this example we have:
n <- 100
x <- rnorm(n)
dat <- data.frame(x)
X <- model.matrix(~x, data = dat)
head(X)
#> (Intercept) x
#> 1 1 -1.4536667
#> 2 1 -1.6627647
#> 3 1 0.9356198
#> 4 1 -1.1253905
#> 5 1 -1.4189142
#> 6 1 -0.7415666
Then let’s define the model parameters and compute the predicted values.
b0 <- 0
b1 <- 0.6
sigma2 <- 1
dat$lp <- b0 + b1*x
plot(dat$x, dat$lp)
Now, we are fitting a model with a Gaussian random component and an identity link function. Thus using the \(g\) function has no effect.
Now we can fit the appropriate model using the glm
function:
#>
#> Call:
#> glm(formula = y ~ x, family = gaussian(link = "identity"), data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.0800 -0.8376 0.0194 0.7584 3.1701
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.02196 0.10330 0.213 0.832
#> x 0.55698 0.10298 5.409 4.49e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.067001)
#>
#> Null deviance: 135.78 on 99 degrees of freedom
#> Residual deviance: 104.57 on 98 degrees of freedom
#> AIC: 294.25
#>
#> Number of Fisher Scoring iterations: 2
A faster way, especially with many parameters is using matrix multiplication between the \(X\) matrix and the vector of coefficients:
\[\begin{equation} \boldsymbol{y} = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ 1 & x_{3} \\ 1 & x_{4} \\ \vdots & x_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n \end{bmatrix} \end{equation}\]
Now let’s add another effect, for example a binary variable group
:
group <- c("a", "b")
x <- rnorm(n*2)
dat <- data.frame(
x = x,
group = rep(group, each = n)
)
X <- model.matrix(~ group + x, data = dat)
head(X)
#> (Intercept) groupb x
#> 1 1 0 1.3036322
#> 2 1 0 2.7934811
#> 3 1 0 -0.8691226
#> 4 1 0 1.0234322
#> 5 1 0 -0.4395868
#> 6 1 0 0.8852635
Now the model matrix has another column groupb
that is the dummy-coded version of the group
variable. Now let’s set the parameters:
b0 <- 0 # y value when group = "a" and x = 0
b1 <- 1 # difference between groups
b2 <- 0.6 # slope of the group
sigma2 <- 1 # residual variance
Then we can compute the formula adding the new parameters:
dat |>
ggplot(aes(x = x, y = y, color = group)) +
geom_point() +
geom_smooth(method = "lm",
formula = y ~ x,
se = FALSE)
The same using matrix formulation:
Then we can fit the model:
#>
#> Call:
#> lm(formula = y ~ group + x, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.75930 -0.77328 -0.02422 0.63481 2.72773
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.007024 0.100033 0.070 0.944
#> groupb 1.179070 0.141083 8.357 1.14e-14 ***
#> x 0.617479 0.070763 8.726 1.11e-15 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.993 on 197 degrees of freedom
#> Multiple R-squared: 0.4035, Adjusted R-squared: 0.3975
#> F-statistic: 66.63 on 2 and 197 DF, p-value: < 2.2e-16
The workflow presented before can be applied to GLMs. The only extra steps is performing the link-function transformation.
We simulate data fixing coefficients and computing \(\eta\), then we apply the inverse of the link function (4 and 5 from the workflow slide).
Let’s simulate the effect of a continuous predictor on the probability of success, thus using a Binomial model.
ns <- 100 # sample size
x <- runif(ns) # x predictor
b0 <- qlogis(0.001) # probability of correct response when x is 0
b1 <- 10 # increase in the logit of a correct response by unit increase in x
dat <- data.frame(id = 1:ns, x = x)
head(dat)
#> # A tibble: 6 × 2
#> id x
#> <int> <dbl>
#> 1 1 0.841
#> 2 2 0.461
#> 3 3 0.330
#> 4 4 0.543
#> 5 5 0.650
#> 6 6 0.965
Let’s compute the \(\eta\) by doing the linear combination of predictors and coefficients:
dat$lp <- b0 + b1 * dat$x
ggplot(dat, aes(x = x, y = lp)) +
geom_line() +
ylab(latex("\\eta")) +
xlab("x")
Then we can compute \(g^{-1}(\eta)\) applying the inverse of the link function. Let’s use the logit:
So far we have the expected probability of success for each participant and \(x\), but we need to include the random component. We can use \(p\) or \(g^{-1}(\eta)\) more generally to sample from the \(\mu\) parameter of the probability distribution.
Now we have simulated a vector of responses with the appropriate random component. We can plot the results.
dat |>
ggplot(aes(x = x, y = y)) +
geom_point(position = position_jitter(height = 0.05)) +
stat_smooth(method = "glm",
method.args = list(family = fam),
se = FALSE)
Finally we can fit the model and see if the parameters are estimated correctly. Of course, we know the true data generation process thus we are fitting the best model.
#>
#> Call:
#> glm(formula = y ~ x, family = fam, data = dat)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1512 -0.5100 -0.1833 0.4998 2.5647
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -5.858 1.112 -5.266 1.40e-07 ***
#> x 8.453 1.619 5.220 1.79e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 129.489 on 99 degrees of freedom
#> Residual deviance: 72.287 on 98 degrees of freedom
#> AIC: 76.287
#>
#> Number of Fisher Scoring iterations: 6
Once the data generation process and the model has been defined, the power analysis is straightforward.
The hardest part is fixing plausible values according to your knowledge and/or previous literature.
For example, there are methods to convert from odds ratio to Cohen’s \(d\) or other metrics.
The effectsize
package is a great resource to understand and compute effect sizes.
or <- 1.5 # odds ratio
effectsize::oddsratio_to_d(or)
#> [1] 0.2235446
We can see the relationship between \(d\) and (log) Odds Ratio:
For example we can a logistic regression with a binary predictor, fixing the effect size:
n <- 30 # number of subjects
d <- 0.5 # effect size in cohen's d
or <- effectsize::d_to_oddsratio(d) # this is beta1
x <- rep(c("a", "b"), each = n)
xc <- ifelse(x == "a", 0, 1)
dat <- data.frame(x = x, xc = xc)
b0 <- qlogis(0.3) # probability of a
b1 <- log(or)
dat$lp <- b0 + b1 * dat$xc
dat$y <- rbinom(nrow(dat), 1, plogis(dat$lp))
head(dat)
#> # A tibble: 6 × 4
#> x xc lp y
#> <chr> <dbl> <dbl> <int>
#> 1 a 0 -0.847 0
#> 2 a 0 -0.847 0
#> 3 a 0 -0.847 0
#> 4 a 0 -0.847 1
#> 5 a 0 -0.847 0
#> 6 a 0 -0.847 0
Clearly, we need to repeat the sampling process several times, store the results (e.g., the p-value of \(\beta_1\)) and then compute the power.
With just one condition the power analysis is not really meaningful. We can compute the same for different sample sizes. Here my code is using a series of for
loops but there could be a nicer implementation.
ns <- c(30, 50, 100, 150)
power <- rep(0, length(ns))
for(i in 1:length(ns)){
p <- rep(0, nsim)
for(j in 1:nsim){
dat <- data.frame(id = 1:ns[i], x = rep(c("a", "b"), each = ns[i]))
dat$xc <- ifelse(dat$x == "a", 0, 1)
dat$lp <- b0 + b1 * dat$xc
dat$y <- rbinom(nrow(dat), 1, plogis(dat$lp))
fit <- glm(y ~ x, data = dat, family = fam)
p[j] <- summary(fit)$coefficients["xb", "Pr(>|z|)"]
}
power[[i]] <- mean(p <= 0.05)
}
Then we can compute the results: