Replicability Crisis in Science?
Filippo Gambarota
8-10 July 2024
The meta-analysis is a statistical procedure to combine evidence from a group of studies.
It is usually combined with a systematic review of the literature
Is somehow the gold-standard approach when we want to summarise and make inference on a specific research area
To compare results from different studies, we should use a common metrics. Frequently meta-analysts use standardized effect sizes. For example the Pearson correlation or the Cohen’s \(d\).
\[ r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2}\sum{(y_i - \bar{y})^2}}} \]
\[ d = \frac{\bar{x_1} - \bar{x_2}}{s_p} \]
\[ s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]
The advantage of standandardized effect size is that regardless the original variable, the intepretation and the scale is the same. For example the pearson correlation ranges between -1 and 1 and the Cohen’s \(d\) between \(- \infty\) and \(\infty\) and is intepreted as how many standard deviations the two groups differs.
S <- matrix(c(1, 0.7, 0.7, 1), nrow = 2)
X <- MASS::mvrnorm(100, c(0, 2), S, empirical = TRUE)
par(mfrow = c(1,2))
plot(X, xlab = "x", ylab = "y", cex = 1.3, pch = 19,
cex.lab = 1.2, cex.axis = 1.2,
main = latex2exp::TeX(sprintf("$r = %.2f$", cor(X[, 1], X[, 2]))))
abline(lm(X[, 2] ~ X[, 1]), col = "firebrick", lwd = 2)
plot(density(X[, 1]), xlim = c(-5, 7), ylim = c(0, 0.5), col = "dodgerblue", lwd = 2,
main = latex2exp::TeX(sprintf("$d = %.2f$", lsr::cohensD(X[, 1], X[, 2]))),
xlab = "")
lines(density(X[, 2]), col = "firebrick", lwd = 2)
For example, there are multiple methods to estimate the Cohen’s \(d\) sampling variability. For example:
\[ V_d = \frac{n_1 + n_2}{n_1 n_2} + \frac{d^2}{2(n_1 + n_2)} \]
Each effect size has a specific formula for the sampling variability. The sample size is usually the most important information. Studies with high sample size have low sampling variability.
As the sample size grows and tends to infinity, the sampling variability approach zero.
Meta-analysis notation is a little bit inconsistent in textbooks and papers. We define here some rules to simplify the work.
For the examples and plots I’m going to use simulated data1. We simulate unstandardized effect sizes (UMD) because the computations are easier and the estimator is unbiased (e.g., Viechtbauer 2005)
More specifically we simulate hypothetical studies where two independent groups are compared:
\[ \Delta = \overline{X_1} - \overline{X_2} \]
\[ SE_{\Delta} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}} \]
With \(X_{1_i} \sim \mathcal{N}(0, 1)\) and \(X_{2_i} \sim \mathcal{N}(\Delta, 1)\)
The main advantage is that, compared to standardized effect size, the sampling variability do not depends on the effect size itself, simplifying the computations.
Let’s imagine to have \(k = 10\) studies. The segments are the 95% confidence intervals.
What could you say about the phenomenon?
We could say that the average effect is roughly ~0.5 and there is some variability around the average.
avg <- mean(dat$yi)
qf + geom_vline(xintercept = avg, color = "firebrick") +
geom_segment(aes(x = yi, y = id-0.3, xend = avg, yend = id-0.3),
color = "firebrick")
\[ \overline y = \frac{\sum_{i = 1}^k y_i}{k} \]
ybar <- mean(dat$yi)
ybar
#> [1] 0.5600977
The simple average is a good statistics. But some studies are clearly more precise (narrower confidence intervals) than others i.e. the sampling variance is lower.
We can compute a weighted average where each study is weighted (\(w_i\)) by the inverse of the variance. This is called inverse-variance weighting. Clearly, a weighted average were all weights are the same reduced to a simple unweighted average.
\[ \overline y = \frac{\sum_{i = 1}^k y_iw_i}{\sum_{i = 1}^k w_i} \] \[ w_i = \frac{1}{v_i} \tag{1}\]
wi <- 1/dat$vi
ywbar <- weighted.mean(dat$yi, wi)
ywbar
#> [1] 0.4558081
The value is (not so) different compared to the unweighted version (0.5600977). They are not exactly the same but given that weights are pretty homogeneous the two estimate are similar.
What we did is a very simple model but actually is a meta-analysis model. This is commonly known as equal-effects model (or fixed-effect) model.
quick_forest(dat, weigth = TRUE) +
geom_vline(xintercept = ywbar, color = "firebrick") +
ggtitle("Weighted Average")
The EE model is the simplest meta-analysis model. The assumptions are:
Formally, we can define the EE model as: \[ y_i = \theta + \epsilon_i \tag{2}\]
\[ \epsilon_i \sim \mathcal{N}(0, \sigma^2_i) \tag{3}\]
Where \(\sigma^2_i\) is the vector of sampling variabilities of \(k\) studies. This is a standard linear model but with heterogeneous sampling variances.
A crucial part of the EE model is that, assuming studies with very large sample sizes, \(\epsilon_i\) will tend to 0 and each study is an almost perfect estimation of \(\theta\). Let’s simulate two models, with \(k = 10\) studies and \(n = 30\) and \(n = 500\). The effect size is the same \(0.5\).
dat_low <- sim_studies(10, 0.5, 0, 30, 30)
dat_high <- sim_studies(10, 0.5, 0, 500, 500)
qf_low <- quick_forest(dat_low) +
geom_vline(xintercept = 0.5, color = "firebrick") +
xlim(c(-2, 2)) +
ggtitle(latex2exp::TeX("$n_{1,2} = 30$"))
qf_high <- quick_forest(dat_high) +
geom_vline(xintercept = 0.5, color = "firebrick") +
xlim(c(-2, 2)) +
ggtitle(latex2exp::TeX("$n_{1,2} = 500$"))
plt_high_low <- cowplot::plot_grid(
qf_low,
qf_high
)
plt_high_low
Clearly, as \(n\) increase, each study is essentially a perfect estimation of \(\theta\) as depicted in the theoretical figure (see slide 1.11).
The EE model is appropriate if our studies are somehow exact replications of the exact same effect. We are assuming that there is no real variability.
However, meta-analysis rarely report the results of \(k\) exact replicates. It is more common to include studies with the same underlying objective but a roughly similar method.
In this case, even with extremely large studies, our effect could be larger in some conditions or smaller or absent in other conditions.
In other terms we are assuming that there could be some variability (i.e., heterogeneity) among studies that is independent from the sample size (or more generally the precision)
We can extend the EE model including another source of variability, \(\tau^2\). \(\tau^2\) is the true heterogeneity among studies caused by methdological differences or intrisic variability in the phenomenon.
Formally we can extend the equation 3 as: \[ y_i = \mu_{\theta} + \delta_i + \epsilon_i \tag{4}\]
\[ \delta_i \sim \mathcal{N}(0, \tau^2) \tag{5}\]
\[ \epsilon_i \sim \mathcal{N}(0, \sigma^2_i) \]
Where \(\mu_{\theta}\) is the average effect size and \(\delta_i\) is the study-specific deviation from the average effect (regulated by \(\tau^2\)).
Given that we extended the EE model equation. Also the estimation of the average effect need to be extended. Basically the RE is still a weighted average but weights need to include also \(\tau^2\).
\[ \overline y = \frac{\sum_{i = 1}^k y_iw^*_i}{\sum_{i = 1}^k w^*_i} \tag{6}\]
\[ w^*_i = \frac{1}{v_i + \tau^2} \tag{7}\]
The consequence is that weights are different where extremely precise/imprecise studies will impact less the estimation of the average effect under the RE model compared to EE.
The crucial difference with the EE model is that even with large \(n\), only the \(\mu_{\theta} + \delta_i\) are estimated (almost) without error. As long \(\tau^2 \neq 0\) there will be variability in the effect sizes.
Again, we can easily demonstrate this with a simulation. We use the same simulation as slide 1.12 but including \(\tau^2 = 0.2\).
dat_low <- sim_studies(10, 0.5, 0.2, 30, 30)
dat_high <- sim_studies(10, 0.5, 0.2, 500, 500)
qf_low <- quick_forest(dat_low) +
geom_vline(xintercept = 0.5, color = "firebrick") +
xlim(c(-3, 3)) +
ggtitle(latex2exp::TeX("$n_{1,2} = 30$"))
qf_high <- quick_forest(dat_high) +
geom_vline(xintercept = 0.5, color = "firebrick") +
xlim(c(-3, 3)) +
ggtitle(latex2exp::TeX("$n_{1,2} = 500$"))
qf_tau_high_low <- cowplot::plot_grid(
qf_low,
qf_high
)
qf_tau_high_low
Even with \(n \to \infty\) the observed variance is not reduced as long \(\tau^2 \neq 0\).
For the simulations, we can generate data from effect size and variance sampling distributions1. The unstandardized effect size is a mean difference between independent groups. The sampling distribution is:
\[ y_i \sim \mathcal{N}(\Delta, \frac{1}{n_1} + \frac{1}{n_2}) \]
Where \(\Delta\) is the population level effect size and \(n_{1,2}\) are the sample sizes of the two studies. The sampling variances are generated from a \(\chi^2\) distribution:
\[ \sigma_i^2 \sim \frac{\chi^2_{n_1 + n_2 - 2}}{n_1 + n_2 - 2} (\frac{1}{n_1} + \frac{1}{n_2}) \]
Clearly we can include \(\tau^2\) to include between-study variability. For an equal-effects model \(\Delta = \theta\) thus the equation is unchanged. For a random-effects model, \(\Delta = \theta_i = \mu_\theta + \delta_i\)
\[ y_i \sim \mathcal{N}(\Delta, \tau^2 + \frac{1}{n_1} + \frac{1}{n_2}) \]
The simulation is implemented in the sim_studies
function:
dat <- sim_studies(k = 10, theta = 0.5, tau2 = 0.2, n0 = 30, n1 = 30)
dat
#> yi vi sei
#> 1 0.03722128 0.06456132 0.2540892
#> 2 0.86109241 0.07728935 0.2780096
#> 3 -0.20386019 0.08004369 0.2829199
#> 4 0.97094251 0.07066803 0.2658346
#> 5 0.57983381 0.06377108 0.2525294
#> 6 1.53919586 0.07696243 0.2774210
#> 7 0.19259021 0.07343220 0.2709838
#> 8 0.38608057 0.05963238 0.2441974
#> 9 1.19627080 0.07010046 0.2647649
#> 10 0.85360445 0.07567874 0.2750977
We discussed so far about estimating the average effect (\(\theta\) or \(\mu_\theta\)). But how to estimate the heterogeneity?
There are several estimators for \(\tau^2\)
We will mainly use the REML estimator. See Viechtbauer (2005) and Veroniki et al. (2016) for more details.
The Q statistics is used to make statistical inference on the heterogeneity. Can be considered as a weighted sum of squares:
\[ Q = \sum^k_{i = 1}w_i(y_i - \hat \mu)^2 \]
Where \(\hat \mu\) is EE estimation (regardless if \(\tau^2 \neq 0\)) and \(w_i\) are the inverse-variance weights. Note that in the case of \(w_1 = w_2 ... = w_i\), Q is just a standard sum of squares (or deviance).
Given that we are summing up squared distances, they should be approximately \(\chi^2\) with \(df = k - 1\). In case of no heterogeneity (\(\tau^2 = 0\)) the observed variability is caused by sampling error only. The expectd value of the \(\chi^2\) is just the degrees of freedom (\(df = k - 1\)).
In case of \(\tau^2 \neq 0\), the expected value is \(k - 1 + \lambda\) where \(\lambda\) is a non-centrality parameter.
In other terms, if the expected value of \(Q\) exceed the expected value assuming no heterogeneity, we have evidence that \(\tau^2 \neq 0\).
Let’s try a more practical approach. We simulate a lot of meta-analysis with and without heterogeneity and we check the Q statistics.
get_Q <- function(yi, vi){
wi <- 1/vi
theta_ee <- weighted.mean(yi, wi)
sum(wi*(yi - theta_ee)^2)
}
k <- 30
n <- 30
tau2 <- 0.1
nsim <- 1e4
Qs_tau2_0 <- rep(0, nsim)
Qs_tau2 <- rep(0, nsim)
res2_tau2_0 <- vector("list", nsim)
res2_tau2 <- vector("list", nsim)
for(i in 1:nsim){
dat_tau2_0 <- sim_studies(k = 30, theta = 0.5, tau2 = 0, n0 = n, n1 = n)
dat_tau2 <- sim_studies(k = 30, theta = 0.5, tau2 = tau2, n0 = n, n1 = n)
theta_ee_tau2_0 <- weighted.mean(dat_tau2_0$yi, 1/dat_tau2_0$vi)
theta_ee <- weighted.mean(dat_tau2$yi, 1/dat_tau2$vi)
res2_tau2_0[[i]] <- dat_tau2_0$yi - theta_ee_tau2_0
res2_tau2[[i]] <- dat_tau2$yi - theta_ee
Qs_tau2_0[i] <- get_Q(dat_tau2_0$yi, dat_tau2_0$vi)
Qs_tau2[i] <- get_Q(dat_tau2$yi, dat_tau2$vi)
}
df <- k - 1
par(mfrow = c(2,2))
hist(Qs_tau2_0, probability = TRUE, ylim = c(0, 0.08), xlim = c(0, 150),
xlab = "Q",
main = latex2exp::TeX("$\\tau^2 = 0$"))
curve(dchisq(x, df), 0, 100, add = TRUE, col = "firebrick", lwd = 2)
hist(unlist(res2_tau2_0), probability = TRUE, main = latex2exp::TeX("$\\tau^2 = 0$"), ylim = c(0, 2),
xlab = latex2exp::TeX("$y_i - \\hat{\\mu}$"))
curve(dnorm(x, 0, sqrt(1/n + 1/n)), add = TRUE, col = "dodgerblue", lwd = 2)
hist(Qs_tau2, probability = TRUE, ylim = c(0, 0.08), xlim = c(0, 150),
xlab = "Q",
main = latex2exp::TeX("$\\tau^2 = 0.1$"))
curve(dchisq(x, df), 0, 100, add = TRUE, col = "firebrick", lwd = 2)
hist(unlist(res2_tau2), probability = TRUE, main = latex2exp::TeX("$\\tau^2 = 0.1$"), ylim = c(0, 2),
xlab = latex2exp::TeX("$y_i - \\hat{\\mu}$"))
curve(dnorm(x, 0, sqrt(1/n + 1/n)), add = TRUE, col = "dodgerblue", lwd = 2)
Let’s try a more practical approach. We simulate a lot of meta-analysis with and without heterogeneity and we check the Q statistics.
The Q statistics is rarely used to directly represent the heterogeneity. The raw measure of heterogeneity is \(\tau^2\). The REML (restricted maximum likelihood) estimator is often used.
\[ \hat \tau^2 = \frac{\sum_{i = 1}^k w_i^2[(y_i - \hat \mu)^2 - \sigma^2_i]}{\sum_{i = 1}^k w_i^2} + \frac{1}{\sum_{i = 1}^k w_i} \]
Where \(\hat{\mu}\) is the weighted average (i.e., maximum likelihood estimation, see Equation 6 and Equation 7).
# we generate 1e5 values from a random-effect model with certain parameters and different tau2 values
# and check the expected distrubution of effect sizes
n <- 100
k <- 1e5
tau2 <- c(0, 0.1, 0.2, 0.5)
dats <- lapply(tau2, function(x) sim_studies(k, 0.5, x, n, n))
names(dats) <- tau2
dat <- bind_rows(dats, .id = "tau2")
dat$tau2 <- factor(dat$tau2)
dat$tau2 <- factor(dat$tau2, labels = latex2exp::TeX(sprintf("$\\tau^2 = %s$", levels(dat$tau2))))
dat |>
ggplot(aes(x = yi, y = after_stat(density))) +
geom_histogram(bins = 50) +
geom_vline(xintercept = 0.5, lwd = 0.5, color = "firebrick") +
facet_wrap(~tau2, labeller = label_parsed) +
ggtitle(latex2exp::TeX("Expected $y_i$ distribution, $d = 0.5$, $n_{1,2} = 100$")) +
xlab(latex2exp::TeX("$y_i$"))
We have two sources of variability in a random-effects meta-analysis, the sampling variabilities \(\sigma_i^2\) and \(\tau^2\). We can use the \(I^2\) to express the interplay between the two. \[ I^2 = 100\% \times \frac{\hat{\tau}^2}{\hat{\tau}^2 + \tilde{v}} \tag{8}\]
\[ \tilde{v} = \frac{(k-1) \sum w_i}{(\sum w_i)^2 - \sum w_i^2}, \]
Where \(\tilde{v}\) is the typical sampling variability. \(I^2\) is intepreted as the proportion of total variability due to real heterogeneity (i.e., \(\tau^2\))
Note that we can have the same \(I^2\) in two completely different meta-analysis. An high \(I^2\) does not represent high heterogeneity. Let’s assume to have two meta-analysis with \(k\) studies and small (\(n = 30\)) vs large (\(n = 500\)) sample sizes.
Let’s solve Equation 8 for \(\tau^2\) (using filor::tau2_from_I2()
) and we found that the same \(I^2\) can be obtained with two completely different \(\tau^2\) values:
n_1 <- 30
vi_1 <- 1/n_1 + 1/n_1
tau2_1 <- filor::tau2_from_I2(0.8, vi_1)
tau2_1
#> [1] 0.2666667
n_2 <- 500
vi_2 <- 1/n_2 + 1/n_2
tau2_2 <- filor::tau2_from_I2(0.8, vi_2)
tau2_2
#> [1] 0.016
n_1 <- 30
vi_1 <- 1/n_1 + 1/n_1
tau2_1 <- filor::tau2_from_I2(0.8, vi_1)
tau2_1
#> [1] 0.2666667
n_2 <- 500
vi_2 <- 1/n_2 + 1/n_2
tau2_2 <- filor::tau2_from_I2(0.8, vi_2)
tau2_2
#> [1] 0.016
In other terms, the \(I^2\) can be considered a good index of heterogeneity only when the total variance (\(\tilde{v} + \tau^2\)) is the same.
In R there are several packages to conduct a meta-analysis. For me, the best package is metafor
(Viechtbauer 2010). The package support all steps in meta-analysis providing also an amazing documentation:
Disclaimer: we are omitting the important part of collecting the information from published studies and calculating the (un)standardized effect sizes. Clerly this part is highly dependent on the actual dataset.
There are few useful resources:
metafor::escalc()
function that calculate all effects size and report an useful documentationWe can simulate an EE dataset (i.e., \(\tau^2 = 0\)) and then fit the model with metafor
:
theta <- 0.3
k <- 30 # number of studies
n <- round(runif(k, 10, 60)) # random sample sizes with a plausible range
dat <- sim_studies(k = k, theta = theta, tau2 = 0, n0 = n, n1 = n)
dat$n <- n # n1 and n2 are the same, put only one
head(dat)
#> yi vi sei n
#> 1 0.5006860 0.05474557 0.2339777 33
#> 2 0.8160800 0.12421999 0.3524486 17
#> 3 0.7038806 0.11651275 0.3413396 17
#> 4 0.3023013 0.14427791 0.3798393 19
#> 5 -0.1286906 0.14727102 0.3837591 18
#> 6 0.5989998 0.03710777 0.1926338 49
We can start by some summary statistics:
# unweighted summary statistics for the effect
summary(dat$yi)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.1602 0.1105 0.2959 0.3077 0.5019 0.8161
# unweighted summary statistics for the variances
summary(dat$vi)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.03592 0.04446 0.06456 0.07630 0.09004 0.21512
# sample sizes
summary(dat$n) # n0 and n1 are the same
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 14.00 19.25 32.00 32.70 43.75 60.00
Then we can use the metafor::rma.uni()
function (or more simply metafor::rma()
) providing the effect size, variances and method = "EE"
to specify that we are fitting an equal-effects model.
fit_ee <- rma(yi, vi, method = "EE", data = dat)
summary(fit_ee)
#>
#> Equal-Effects Model (k = 30)
#>
#> logLik deviance AIC BIC AICc
#> -0.7274 27.3050 3.4547 4.8559 3.5976
#>
#> I^2 (total heterogeneity / total variability): 0.00%
#> H^2 (total variability / sampling variability): 0.94
#>
#> Test for Heterogeneity:
#> Q(df = 29) = 27.3050, p-val = 0.5553
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> 0.3046 0.0449 6.7781 <.0001 0.2166 0.3927 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can easily show that the results is the same as performing a simple weighted average using study-specific variances as weight:
wi <- 1/dat$vi
weighted.mean(dat$yi, wi)
#> [1] 0.3046422
summary(fit_ee)
#>
#> Equal-Effects Model (k = 30)
#>
#> logLik deviance AIC BIC AICc
#> -0.7274 27.3050 3.4547 4.8559 3.5976
#>
#> I^2 (total heterogeneity / total variability): 0.00%
#> H^2 (total variability / sampling variability): 0.94
#>
#> Test for Heterogeneity:
#> Q(df = 29) = 27.3050, p-val = 0.5553
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> 0.3046 0.0449 6.7781 <.0001 0.2166 0.3927 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The syntax for a random-effects model is the same, we just need to use method = "REML"
(or another \(\tau^2\) estimation method)
fit_re <- rma(yi, vi, method = "REML", data = dat)
summary(fit_re)
#>
#> Random-Effects Model (k = 30; tau^2 estimator: REML)
#>
#> logLik deviance AIC BIC AICc
#> -1.2101 2.4203 6.4203 9.1549 6.8818
#>
#> tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.0148)
#> tau (square root of estimated tau^2 value): 0
#> I^2 (total heterogeneity / total variability): 0.00%
#> H^2 (total variability / sampling variability): 1.00
#>
#> Test for Heterogeneity:
#> Q(df = 29) = 27.3050, p-val = 0.5553
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> 0.3046 0.0449 6.7781 <.0001 0.2166 0.3927 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can easily compare the models using the compare_rma()
function:
filor::compare_rma(fit_ee, fit_re) |>
round(5)
#> fit_ee fit_re
#> b 0.30464 0.30464
#> se 0.04495 0.04495
#> zval 6.77808 6.77808
#> pval 0.00000 0.00000
#> ci.lb 0.21655 0.21655
#> ci.ub 0.39273 0.39273
#> I2 0.00000 0.00000
#> tau2 0.00000 0.00000
What do you think? are there differences between the two models? What could be the reasons?
Clearly, given that we simulated \(\tau^2 = 0\) the RE
model results is very close (if not equal) to the EE
model. We can simulate a RE
model:
tau2 <- 0.2
dat <- sim_studies(k = k, theta = theta, tau2 = tau2, n0 = n, n1 = n)
fit_re <- rma(yi, vi, method = "REML", data = dat)
summary(fit_re)
#>
#> Random-Effects Model (k = 30; tau^2 estimator: REML)
#>
#> logLik deviance AIC BIC AICc
#> -26.0957 52.1914 56.1914 58.9260 56.6529
#>
#> tau^2 (estimated amount of total heterogeneity): 0.2855 (SE = 0.0937)
#> tau (square root of estimated tau^2 value): 0.5343
#> I^2 (total heterogeneity / total variability): 82.53%
#> H^2 (total variability / sampling variability): 5.72
#>
#> Test for Heterogeneity:
#> Q(df = 29) = 167.7055, p-val < .0001
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> 0.3436 0.1093 3.1430 0.0017 0.1293 0.5578 **
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The most common plot to represent meta-analysis results is the forest plot:
forest(fit_re)
The most common plot to represent meta-analysis results is the forest plot:
x
axis represent the effect size
y
axis represent the studies
size of the square
is the weight of that study (\(w_i = 1/\sigma_i^2\))interval
is the 95% confidence interval
diamond
is the estimated effect and the 95% confidence interval
It is also common to plot studies ordering by the size of the effect, showing asymmetries or other patterns:
forest(fit_re, order = "yi")
So far we reasoned about collecting published studies and conducting a meta-analysis. Despite pooling evidence from multiple studies there are several limitations:
With multilab studies we define a new data collection where different research group conduct an experiment with a similar (or the exact same) methodology. In this way we can:
From a statistical point of view, the only difference is the source of the data (planned and collected vs collected from the literature). In fact, a multilab study can be analyzed also using standard meta-analysis tools.
When intepreting the results we could highlight some differences:
Let’s imagine to plan a multilab study with the same setup as the previous meta-analysis. We are not collecting data from the literature but we want to know the number of studies \(k\) that we need to achieve e.g. a good statistical Power.
We define power as the probability of correctly rejecting the null hypothesis of no effect when the effect is actually present
We can do a Power analysis using Monte Carlo simulations:
We can implement a simple simulation as:
library(metafor)
library(ggplot2)
# set.seed(2023)
k <- seq(10, 100, 10)
n <- seq(10, 100, 20)
es <- 0.3
tau2 <- 0.1
alpha <- 0.05
nsim <- 1e3
sim <- tidyr::expand_grid(k, n, es, tau2)
sim$power <- 0
# handle errors, return NA
srma <- purrr::possibly(rma, otherwise = NA)
for(i in 1:nrow(sim)){ # iterate for each condition
pval <- rep(0, nsim)
for(j in 1:nsim){ # iterate for each simulation
dat <- sim_studies(k = sim$k[i],
theta = sim$es[i],
tau2 = sim$tau2[i],
n0 = sim$n[i],
n1 = sim$n[i])
fit <- srma(yi, vi, data = dat, method = "REML")
pval[j] <- fit$pval
}
sim$power[i] <- mean(pval <= alpha) # calculate power
filor::pb(nrow(sim), i)
}
saveRDS(sim, here("04-meta-analysis/objects/power-example.rds"))
Then we can plot the results:
sim <- readRDS(here("slides/03-meta-analysis/objects/power-example.rds"))
title <- "$\\mu_\\theta = 0.3$, $\\tau^2 = 0.1$, $\\alpha = 0.05"
sim |>
ggplot(aes(x = k, y = power, group = n, color = factor(n))) +
geom_line(lwd = 1) +
xlab("Number of Studies (k)") +
ylab("Power") +
labs(color = "Sample Size") +
theme(legend.position = "bottom") +
ggtitle(latex2exp::TeX(title))
This is a flexible way to simulate and plan a multilab study:
The PB is one of the most problematic aspects of meta-analysis. Essentially the probability of publishing a paper (~and thus including into the meta-analysis) is not the same regardless the result. Clearly we could include also the gray literature to mitigate the problem.
We cannot (completely) solve the PB using statistical tools. The PB is a problem related to the publishing process and publishing incentives (significant results are more catchy and easy to explain).
The first tool is called funnel plot. This is a visual tool to check the presence of asymmetry that could be caused by publication bias. If meta-analysis assumptions are respected, and there is no publication bias:
Let’s simulate a lot of studies to show:
Let’s plot the distribution of the data:
hist(dat$yi)
The distribution is clearly normal and centered on the true simulated value. Now let’s add the precision (in this case standard error thus \(\sqrt{\sigma_i^2}\)) on the y-axis.
Note that the y
axis is reversed so high-precise studies (\(\sqrt{\sigma_i^2}\) close to 0) are on top.
The plot assume the typical funnel shape and there are not missing spots on the at the bottom. The presence of missing spots is a potential index of publication bias.
The plot assume the typical funnel shape and there are not missing spots on the at the bottom. The presence of missing spots is a potential index of publication bias.
The Fail-safe N (Rosenthal 1979) idea is very simple. Given a meta-analysis with a significant result (i.e., \(p \leq \alpha\)). How many null studies (i.e., \(\hat \theta = 0\)) do I need to obtain \(p > \alpha\)?
metafor::fsn(yi, vi, data = dat)
#>
#> Fail-safe N Calculation Using the Rosenthal Approach
#>
#> Observed Significance Level: <.0001
#> Target Significance Level: 0.05
#>
#> Fail-safe N: 4396757
There are several criticism to the Fail-safe N procedure:
A basic method to test the funnel plot asymmetry is using an the Egger regression test. Basically we calculate the relationship between \(y_i\) and \(\sqrt{\sigma^2_i}\). In the absence of asimmetry, the line slope should be not different from 0.
We can use the metafor::regtest()
function:
egger <- regtest(fit)
egger
#>
#> Regression Test for Funnel Plot Asymmetry
#>
#> Model: fixed-effects meta-regression model
#> Predictor: standard error
#>
#> Test for Funnel Plot Asymmetry: z = 1.0847, p = 0.2781
#> Limit Estimate (as sei -> 0): b = 0.4821 (CI: 0.4521, 0.5122)
This is a standard (meta) regression thus the number of studies, the precision of each study and heterogeneity influence the reliability (power, type-1 error rate, etc.) of the procedure.
The Trim and Fill method (Duval and Tweedie 2000) is used to impute the hypothetical missing studies according to the funnel plot and recomputing the meta-analysis effect. Shi and Lin (Shi and Lin 2019) provide an updated overview of the method with some criticisms.
Let’s simulate again a publication bias with \(k = 100\) studies:
Now we can use the metafor::trimfill()
function:
taf <- metafor::trimfill(fit)
taf
#>
#> Estimated number of missing studies on the left side: 29 (SE = 6.4513)
#>
#> Random-Effects Model (k = 129; tau^2 estimator: REML)
#>
#> tau^2 (estimated amount of total heterogeneity): 0.1382 (SE = 0.0203)
#> tau (square root of estimated tau^2 value): 0.3718
#> I^2 (total heterogeneity / total variability): 88.56%
#> H^2 (total variability / sampling variability): 8.74
#>
#> Test for Heterogeneity:
#> Q(df = 128) = 1073.8812, p-val < .0001
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> 0.4851 0.0357 13.5951 <.0001 0.4152 0.5551 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The trim-and-fill estimates that 29 are missing. The new effect size after including the studies is reduced and closer to the simulated value (but in this case still significant).
We can also visualize the funnel plot highligting the points that are included by the method.
funnel(taf)
funnel(taf)
egg <- regtest(fit)
egg_npb <- regtest(taf)
se <- seq(0,1.8,length=100)
lines(coef(egg$fit)[1] + coef(egg$fit)[2]*se, se, lwd=3, col = "black")
lines(coef(egg_npb$fit)[1] + coef(egg_npb$fit)[2]*se, se, lwd=3, col = "firebrick")
legend("topleft", legend = c("Original", "Corrected"), fill = c("black", "firebrick"))
The random-effect meta-analysis PDF can be written as (e.g., Citkowicz and Vevea 2017):
\[ f\left(y_i \mid \beta, \tau^2 ; \sigma_i^2\right)=\phi\left(\frac{y_i-\Delta_i}{\sqrt{\sigma_i^2+\tau^2}}\right) / \sqrt{\sigma_i^2+\tau^2}, \]
And adding the weight function:
\[ f\left(Y_i \mid \beta, \tau^2 ; \sigma_i^2\right)=\frac{\mathrm{w}\left(p_i\right) \phi\left(\frac{y_i-\Delta_i}{\sqrt{\sigma_i^2+\tau^2}}\right) / \sqrt{\sigma_i^2+\tau^2}}{\int_{-\infty}^{\infty} \mathrm{w}\left(p_i\right) \phi\left(\frac{Y_i-\Delta_i}{\sqrt{\sigma_i^2+\tau^2}}\right) / \sqrt{\sigma_i^2+\tau^2} d Y_i} \]
For example, Citkowicz and Vevea (2017) proposed a model using a weight function based on the Beta distribution with two parameters \(a\) and \(b\)1 \(w(p_i) = p_i^{a - 1} \times (1 - p_i)^{b - 1}\):
In R we can use the metafor::selmodel()
function to implement several type of models. For example we can apply the Citkowicz and Vevea (2017) model:
sel_beta <- selmodel(fit, type = "beta")
#>
#> Random-Effects Model (k = 100; tau^2 estimator: ML)
#>
#> tau^2 (estimated amount of total heterogeneity): 0.0857 (SE = 0.0255)
#> tau (square root of estimated tau^2 value): 0.2928
#>
#> Test for Heterogeneity:
#> LRT(df = 1) = 293.8104, p-val < .0001
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> 0.4847 0.1020 4.7521 <.0001 0.2848 0.6846 ***
#>
#> Test for Selection Model Parameters:
#> LRT(df = 2) = 14.3271, p-val = 0.0008
#>
#> Selection Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> delta.1 0.8174 0.0700 -2.6075 0.0091 0.6802 0.9547 **
#> delta.2 0.7017 0.2009 -1.4846 0.1376 0.3080 1.0955
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(sel_beta)
Let’s try the Beta selection model without publication bias:
metafor::selmodel()
https://wviechtb.github.io/metafor/reference/selmodel.html
Assessing, testing and developing sofisticated models for publication bias is surely important and interesting. But as Wolfgang Viechtbauer (the author of metafor
) said:
hopefully there won’t be need for these models in the future (Viechtbauer 2021)
metafor
website contains a lot of materials, examples, tutorials about meta-analysis models