Filippo Gambarota
Gianmarco Altoè
University of Padova
08 February 2024
The stastistical power is defined as the probability of correctly rejecting the null hypothesis \(H_0\).
For simple designs such as t-test, ANOVA, etc. the power can be computed analytically. For example, let’s find the power of detecting an effect size of \(d = 0.5\) with \(n1 = n2 = 30\).
d <- 0.5
alpha <- 0.05
n1 <- n2 <- 30
sp <- 1
# Calculate non-centrality parameter (delta)
delta <- d * sqrt(n1 * n2 / (n1 + n2))
# Calculate degrees of freedom
df <- n1 + n2 - 2
# Calculate critical t-value
critical_t <- qt(1 - alpha / 2, df)
# Calculate non-central t-distribution value
non_central_t <- delta / sp
# Calculate power
1 - pt(critical_t - non_central_t, df)
#> [1] 0.4741093
The same can be done using the pwr
package:
power <- pwr::pwr.t.test(n = n1, d = 0.5)
power
#>
#> Two-sample t test power calculation
#>
#> n = 30
#> d = 0.5
#> sig.level = 0.05
#> power = 0.4778965
#> alternative = two.sided
#>
#> NOTE: n is number in *each* group
Sometimes the analytical solution is not available or we can estimate the power for complex scenarios (missing data, unequal variances, etc.). The general workflow is:
For example, the power is the number of p-values lower than \(\alpha\) over the total number of simulations.
Let’s see the previous example using simulations:
The estimated value is pretty close to the analytical value.
Also for meta-analysis we have the two approaches analytical and simulation-based.
For the analytical approach we need to make some assumptions:
Under these assumptions the power is:
\[ (1 - \Phi(c_{\alpha} - \lambda)) + \Phi(-c_{\alpha} - \lambda) \]
Where \(c_{\alpha}\) is the critical \(z\) value and \(\lambda\) is the observed statistics.
For an EE model the only source of variability is the sampling variability, thus \(\lambda\):
\[ \lambda_{EE} = \frac{\theta}{\sqrt{\sigma^2_{\theta}}} \]
And recalling previous assuptions where \(\sigma^2_1 = \dots = \sigma^2_k\):
\[ \sigma^2_{\theta} = \frac{\sigma^2}{k} \]
For example, a meta-analysis of \(k = 15\) studies where each study have a sample size of \(n1 = n2 = 20\) (assuming again unstandardized mean difference as effect size):
Be careful that the EE model is assuming \(\tau^2 = 0\) thus is like having a huge study with \(k \times n_1\) participants per group.
For the RE model we just need to include \(\tau^2\) in the \(\lambda\) calculation, thus:
\[ \sigma^{2\star}_{\theta} = \frac{\sigma^2 + \tau^2}{k} \]
The other calculations are the same as the EE model.
The power is reduced because we are considering another source of heterogeneity. Clearly the maximal power of \(k\) studies is achieved when \(\tau^2 = 0\). Hypothetically we can increase the power either increasing \(k\) (the number of studies) or reducing \(\sigma^2_k\) (increasing the number of participants in each study).
The most informative approach is plotting the power curves for different values of \(\tau^2\), \(\sigma^2_k\) and \(\theta\) (or \(\mu_{\theta}\)).
You can use the power_meta()
function:
k <- c(5, 10, 30, 50, 100)
es <- c(0.1, 0.3)
tau2 <- c(0, 0.05, 0.1, 0.2)
n <- c(10, 30, 50, 100, 1000)
power <- expand_grid(es, k, tau2, n1 = n)
power$power <- power_meta(power$es, power$k, power$tau2, power$n1)
power$es <- factor(power$es, labels = latex2exp::TeX(sprintf("$\\mu_{\\theta} = %s$", es)))
power$tau2 <- factor(power$tau2, labels = latex2exp::TeX(sprintf("$\\tau^2 = %s$", tau2)))
ggplot(power, aes(x = factor(k), y = power, color = factor(n1))) +
geom_point() +
geom_line(aes(group = factor(n1))) +
facet_grid(es~tau2, labeller = label_parsed) +
xlab("Number of Studies (k)") +
ylab("Power") +
labs(
color = latex2exp::TeX("$n_1 = n_2$")
)
With the analytical approach we can (quickly) do interesting stuff. For example, we fix the total \(N = n_1 + n_2\) for a series of \(k\) and check the power.
# average meta k = 20, n = 30
kavg <- 20
navg <- 30
N <- kavg * (navg*2)
es <- 0.3
tau2 <- c(0, 0.05, 0.1, 0.2)
k <- seq(10, 100, 10)
n1 <- n2 <- round((N/k)/ 2)
sim <- data.frame(es, k, n1, n2)
sim <- expand_grid(sim, tau2 = tau2)
sim$power <- power_meta(sim$es, sim$k, sim$tau2, sim$n1, sim$n2)
sim$N <- with(sim, k * (n1 + n2))
ggplot(sim, aes(x = k, y = power, color = factor(tau2))) +
geom_line() +
ggtitle(latex2exp::TeX("Total N ($n_1 + n_2$) = 1200")) +
labs(x = "Number of Studies (k)",
y = "Power",
color = latex2exp::TeX("$\\tau^2$"))
As long as \(\tau^2 \neq 0\) we need more studies (even if the total sample size is the same).
With simulations we can fix or relax the previous assumptions. For example, let’s compute the power for an EE model:
The value is similar to the analytical simulation. But we can improve it e.g. generating heterogeneous sample sizes.
By repeating the previous approach for a series of parameters we can easily draw a power curve:
k <- c(5, 10, 50, 100)
es <- 0.1
tau2 <- c(0, 0.05, 0.1, 0.2)
nsim <- 1000
grid <- expand_grid(k, es, tau2)
power <- rep(NA, nrow(grid))
for(i in 1:nrow(grid)){
pval <- rep(NA, nsim)
for(j in 1:nsim){
n <- rpois(grid$k[i], 40)
dat <- sim_studies(grid$k[i], grid$es[i], grid$tau2[i], n)
fit <- rma(yi, vi, data = dat)
pval[j] <- fit$pval
}
power[i] <- mean(pval <= 0.05)
}
grid$power <- power
The power for a meta-regression can be easily computed by simulating the moderator effect. For example, let’s simulate the effect of a binary predictor \(x\).
k <- seq(10, 100, 10)
b0 <- 0.2 # average of group 1
b1 <- 0.1 # difference between group 1 and 2
tau2r <- 0.2 # residual tau2
nsim <- 1000
power <- rep(NA, length(k))
for(i in 1:length(k)){
es <- b0 + b1 * rep(0:1, each = k[i]/2)
pval <- rep(NA, nsim)
for(j in 1:nsim){
n <- round(runif(k[i], 10, 100))
dat <- sim_studies(k[i], es, tau2r, n)
fit <- rma(yi, vi, data = dat)
pval[j] <- fit$pval
}
power[i] <- mean(pval <= 0.05)
}
Then we can plot the results:
power <- data.frame(k = k, power = power)
ggplot(power, aes(x = k, y = power)) +
geom_line()
Multilab studies can be seen as a meta-analysis that is planned (a prospective meta-analysis) compared to standard retrospective meta-analysis.
The statistical approach is (roughly) the same with the difference that we have control both on \(k\) (the number of experimental units) and \(n\) the sample size within each unit.
In multilab studies we have also the raw data (i.e., participant-level data) thus we can do more complex multilevel modeling.
Assuming that we have \(k\) studies with raw data available there is no need to aggregate, calculate the effect size and variances and then use an EE or RE model.
k <- 50
es <- 0.4
tau2 <- 0.1
n <- round(runif(k, 10, 100))
dat <- vector(mode = "list", k)
thetai <- rnorm(k, 0, sqrt(tau2))
for(i in 1:k){
g1 <- rnorm(n[i], 0, 1)
g2 <- rnorm(n[i], es + thetai[i], 1)
d <- data.frame(id = 1:(n[i]*2), unit = i, y = c(g1, g2), group = rep(c(0, 1), each = n[i]))
dat[[i]] <- d
}
dat <- do.call(rbind, dat)
ht(dat)
#> id unit y group
#> 1 1 1 0.2893323 0
#> 2 2 1 -1.1718120 0
#> 3 3 1 -0.8300565 0
#> 4 4 1 0.1406313 0
#> 5 5 1 1.0616239 0
#> 5473 65 50 0.3235104 1
#> 5474 66 50 0.4872725 1
#> 5475 67 50 -0.5355670 1
#> 5476 68 50 2.4656131 1
#> 5477 69 50 -0.6286605 1
#> 5478 70 50 2.4208509 1
This is a simple multilevel model (pupils within classrooms or trials within participants). We can fit the model using lme4::lmer()
:
library(lme4)
fit_lme <- lmer(y ~ group + (1|unit), data = dat)
summary(fit_lme)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ group + (1 | unit)
#> Data: dat
#>
#> REML criterion at convergence: 15712.6
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -3.4455 -0.6876 0.0186 0.6977 3.2117
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> unit (Intercept) 0.04006 0.2002
#> Residual 1.01432 1.0071
#> Number of obs: 5478, groups: unit, 50
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) -0.003784 0.034715 -0.109
#> group 0.417518 0.027215 15.342
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> group -0.392
Let’s do the same as a meta-analysis. Firstly we compute the effect sizes for each unit:
#>
#> unit m0 m1 sd0 sd1 n0 n1 yi vi
#> 1 1 -0.182281099 0.67740970 0.9313053 0.8599736 32 32 0.8597 0.0502
#> 2 2 -0.080684001 0.09258349 0.8352305 1.0518596 59 59 0.1733 0.0306
#> 3 3 -0.194020177 0.04229138 1.0409077 0.9633482 80 80 0.2363 0.0251
#> 4 4 -0.122323160 0.61829855 0.9724882 0.8976187 36 36 0.7406 0.0487
#> 5 5 0.240849696 0.27411106 0.9550909 0.8178345 36 36 0.0333 0.0439
#> 45 45 0.014118034 -0.16639800 1.1398257 0.9029696 81 81 -0.1805 0.0261
#> 46 46 0.258466056 0.70023676 0.9593704 1.0152041 31 31 0.4418 0.0629
#> 47 47 -0.108808858 -0.06605484 1.0886507 1.0221369 47 47 0.0428 0.0474
#> 48 48 -0.101801658 1.06380067 0.9214148 1.0912795 27 27 1.1656 0.0756
#> 49 49 -0.004267983 0.06425461 1.0686583 0.9415882 58 58 0.0685 0.0350
#> 50 50 -0.108280467 0.36343002 0.9505833 1.2846007 35 35 0.4717 0.0730
Then we can fit the model:
fit_rma <- rma(yi, vi, data = datagg)
fit_rma
#>
#> Random-Effects Model (k = 50; tau^2 estimator: REML)
#>
#> tau^2 (estimated amount of total heterogeneity): 0.1112 (SE = 0.0306)
#> tau (square root of estimated tau^2 value): 0.3335
#> I^2 (total heterogeneity / total variability): 75.80%
#> H^2 (total variability / sampling variability): 4.13
#>
#> Test for Heterogeneity:
#> Q(df = 49) = 207.6604, p-val < .0001
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> 0.4653 0.0553 8.4168 <.0001 0.3570 0.5737 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Actually the results are very similar where the standard deviation of the intercepts of the lme4
model is \(\approx \tau\) and the group
effect is the intercept of the rma
model.
data.frame(
b = c(fixef(fit_lme)[2], fit_rma$b),
se = c(summary(fit_lme)$coefficients[2, 2], fit_rma$se),
tau2 = c(as.numeric(VarCorr(fit_lme)[[1]]), fit_rma$tau2),
model = c("lme4", "metafor")
)
#> b se tau2 model
#> group 0.4175182 0.02721483 0.04006087 lme4
#> 0.4653258 0.05528509 0.11120165 metafor
Actually the two model are not exactly the same, especially when using only the aggregated data. See https://www.metafor-project.org/doku.php/tips:rma_vs_lm_lme_lmer.
To note, aggregating data and then computing a standard (non-weighted) model (sometimes this is done with trial-level data) is wrong and should be avoided. Using meta-analysis is clear that aggregating without taking into account the cluster (e.g., study or subject) precision is misleading.
dataggl <- datagg |>
select(unit, m0, m1) |>
pivot_longer(c(m0, m1), values_to = "y", names_to = "group")
summary(lmer(y ~ group + (1|unit), data = dataggl))
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ group + (1 | unit)
#> Data: dataggl
#>
#> REML criterion at convergence: 47.9
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.31836 -0.54276 -0.03706 0.61381 2.36896
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> unit (Intercept) 0.01407 0.1186
#> Residual 0.07520 0.2742
#> Number of obs: 100, groups: unit, 50
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) -0.02872 0.04226 -0.680
#> groupm1 0.48411 0.05485 8.827
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> groupm1 -0.649
When planning a multilab study there is an important decision between increasing the sample size within each unit (more effort for each lab) or recruiting more units with less participants per unit (more effort for the organization).
We could have the situation where the number of units \(k\) is fixed and we can only increase the sample size.
We can also simulate scenarios where some units collect all data while others did not complete the data collection.
Let’s assume that the maximum number of labs is \(10\). How many participants are required assuming a certain amount of heterogeneity?
es <- 0.2
k <- 10
n1 <- n2 <- seq(10, 500, 10)
tau2 <- c(0.01, 0.05, 0.1, 0.2)
sim <- expand_grid(k, es, tau2, n1)
sim$n2 <- sim$n1
sim$vt <- with(sim, 1/n1 + 1/n2)
sim$I2 <- round(with(sim, tau2 / (tau2 + vt)) * 100, 3)
sim$power <- power_meta(sim$es, sim$k, sim$tau2, sim$n1, sim$n2)
ht(sim)
#> # A tibble: 11 × 8
#> k es tau2 n1 n2 vt I2 power
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 0.2 0.01 10 10 0.2 4.76 0.281
#> 2 10 0.2 0.01 20 20 0.1 9.09 0.479
#> 3 10 0.2 0.01 30 30 0.0667 13.0 0.627
#> 4 10 0.2 0.01 40 40 0.05 16.7 0.733
#> 5 10 0.2 0.01 50 50 0.04 20 0.807
#> 6 10 0.2 0.2 450 450 0.00444 97.8 0.288
#> 7 10 0.2 0.2 460 460 0.00435 97.9 0.288
#> 8 10 0.2 0.2 470 470 0.00426 97.9 0.288
#> 9 10 0.2 0.2 480 480 0.00417 98.0 0.288
#> 10 10 0.2 0.2 490 490 0.00408 98 0.288
#> 11 10 0.2 0.2 500 500 0.004 98.0 0.288
With a fixed \(k\), we could reach a plateau even increasing \(n\). This depends also on \(\mu_{\theta}\) and \(\tau^2\).
A special type of multilab studies are the replication projects. There are some paper discussing how to view replication studies as meta-analyses and how to plan them.