Replicability Crisis in Science?
Filippo Gambarota
8-10 July 2024
R is a free software environment for statistical computing and graphics.
The CRAN is the repository where package developers upload their packages and other users can install them.
As the saying goes: if something exist, there is for sure an R package for doing it! 😄
Rank | Language | Share | 1-year.trend |
---|---|---|---|
1 | Python | 28.04% | 0.30% |
2 | Java | 15.78% | -1.30% |
3 | JavaScript | 9.27% | -0.20% |
4 | C# | 6.77% | -0.20% |
5 | C/C++ | 6.59% | 0.40% |
6 | PHP | 5.01% | -0.40% |
7 | R | 4.35% | 0.00% |
8 | TypeScript | 3.09% | 0.30% |
9 | Swift | 2.54% | 0.50% |
10 | Objective-C | 2.15% | 0.10% |
11 | Rust | 2.14% | 0.50% |
The popularity is on a different scale compared to Python but still increasing:
Python is a good alternative. Personally, I use and enjoy python but I do most of my work in R.
Python is a very general-purpose language more powerful for general tasks.
I find python very useful for programming cognitive experiments, image processing, automatizing tasks and interacting with the operating system
R is still a little bit superior in terms of data manipulation and visualization. Python is faster and more powerful for complex machine learning.
In computer science, functional programming is a programming paradigm where programs are constructed by applying and composing functions.
Despite R can be used both with an imperative and object-oriented approach, the functional side is quite powerful.
Actually, functional programming is quite complex. Here we refer to breaking down our code into small functions. These functions can be function from packages, custom or anonymous functions.
We have a dataset (mtcars
) and we want to calculate the mean, median, standard deviation, minimum and maximum of each column and store the result in a table.
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The standard (~imperative) option is using a for
loop, iterating through columns, calculate the values and store into another data structure.
ncols <- ncol(mtcars)
means <- medians <- mins <- maxs <- rep(0, ncols)
for(i in 1:ncols){
means[i] <- mean(mtcars[[i]])
medians[i] <- mean(mtcars[[i]])
mins[i] <- mean(mtcars[[i]])
maxs[i] <- mean(mtcars[[i]])
}
results <- data.frame(means, medians, mins, maxs)
results$col <- names(mtcars)
results
#> means medians mins maxs col
#> 1 20.090625 20.090625 20.090625 20.090625 mpg
#> 2 6.187500 6.187500 6.187500 6.187500 cyl
#> 3 230.721875 230.721875 230.721875 230.721875 disp
#> 4 146.687500 146.687500 146.687500 146.687500 hp
#> 5 3.596563 3.596563 3.596563 3.596563 drat
#> 6 3.217250 3.217250 3.217250 3.217250 wt
#> 7 17.848750 17.848750 17.848750 17.848750 qsec
#> 8 0.437500 0.437500 0.437500 0.437500 vs
#> 9 0.406250 0.406250 0.406250 0.406250 am
#> 10 3.687500 3.687500 3.687500 3.687500 gear
#> 11 2.812500 2.812500 2.812500 2.812500 carb
We can decompose (and symplify the problem) by writing a function and looping through columns.
summ <- function(x){
data.frame(means = mean(x), medians = median(x), mins = min(x), maxs = max(x))
}
ncols <- ncol(mtcars)
dfs <- vector(mode = "list", length = ncols)
for(i in 1:ncols){
dfs[[i]] <- summ(mtcars[[i]])
}
results <- do.call(rbind, dfs)
results
#> means medians mins maxs
#> 1 20.090625 19.200 10.400 33.900
#> 2 6.187500 6.000 4.000 8.000
#> 3 230.721875 196.300 71.100 472.000
#> 4 146.687500 123.000 52.000 335.000
#> 5 3.596563 3.695 2.760 4.930
#> 6 3.217250 3.325 1.513 5.424
#> 7 17.848750 17.710 14.500 22.900
#> 8 0.437500 0.000 0.000 1.000
#> 9 0.406250 0.000 0.000 1.000
#> 10 3.687500 4.000 3.000 5.000
#> 11 2.812500 2.000 1.000 8.000
We can be even more minimalistic by removing the for
loop and using the *apply
family that provide a series of compact iterative method.
results <- lapply(mtcars, summ)
results <- do.call(rbind, results)
results
#> means medians mins maxs
#> mpg 20.090625 19.200 10.400 33.900
#> cyl 6.187500 6.000 4.000 8.000
#> disp 230.721875 196.300 71.100 472.000
#> hp 146.687500 123.000 52.000 335.000
#> drat 3.596563 3.695 2.760 4.930
#> wt 3.217250 3.325 1.513 5.424
#> qsec 17.848750 17.710 14.500 22.900
#> vs 0.437500 0.000 0.000 1.000
#> am 0.406250 0.000 0.000 1.000
#> gear 3.687500 4.000 3.000 5.000
#> carb 2.812500 2.000 1.000 8.000
*apply
The *apply
family is one of the best tool in R. The idea is pretty simple: apply a function to each element of a list.
The powerful side is that in R everything can be considered as a list. A vector is a list of single elements, a dataframe is a list of columns etc.
Internally, R is still using a for
loop but the verbose part (preallocation, choosing the iterator, indexing) is encapsulated into the *apply
function.
for
loops are bad?for
loops are the core of each operation in R (and in every programming language). For complex operation thery are more readable and effective compared to *apply
. In R we need extra care for writing efficent for
loops.
Extremely slow, no preallocation:
res <- c()
for(i in 1:1000){
# do something
res[i] <- x
}
Very fast, no difference compared to *apply
summ
function, you only need to change it once instead of touching the for
loop)The tidyverse
is a series of high-quality R packages to do modern data science:
dplyr
, tidyr
)ggplot2
)rmarkdown
)stringr
)purrr
)One of the great improvement from the tidyverse
is the usage of the pipe %>%
now introduced in base R as |>
. You will se these symbols a lot when looking at modern R code.
The idea is very simple, the standard pattern to apply a function is function(argument)
. The pipe can reverse the pattern as argument |> function()
. Normally when we apply multiple functions progressively the pattern is this:
x <- rnorm(100)
x <- round(x, 3)
x <- abs(x)
x <- as.character(x)
When using the pipe, we remove the redundand assignment <-
pattern:
x <- rnorm(100)
x |>
round(3) |>
abs() |>
as.character()
The pipe can be read as “from x apply round
, then abs
, etc.”. The first argument of the piped function is assumed to be the result of the previus call.
The tidy
approach contains tons of functions and packages. The overall philosopgy can be deepen in the R for Data Science book.
Only an honour mention to ggplot2
https://ggplot2-book.org/ (part of the tidyverse
) that is an amazing package for data visualization following the piping and tidy approach. Is the implementation of the grammar of graphics idea.
library(tidyverse)
iris |>
mutate(wi = runif(n())) |>
ggplot(aes(x = Sepal.Length, y = Petal.Width, color = Species)) +
geom_point(aes(size = wi)) +
geom_smooth(method = "lm", se = FALSE)
guides(size = "none") +
theme_minimal(15)
More verbose, more hard coding, more steps and intermidiate objects.
iris_l <- split(iris, iris$Species)
lms <- lapply(iris_l, function(x) lm(Petal.Width ~ Sepal.Length, data = x))
plot(iris$Sepal.Length, iris$Petal.Width, col = as.numeric(iris$Species), pch = 19)
abline(lms[[1]], col = 1, lwd = 2)
abline(lms[[2]], col = 2, lwd = 2)
abline(lms[[3]], col = 3, lwd = 2)
legend("topleft", legend = levels(iris$Species), fill = 1:3)
The ggplot2
book https://ggplot2-book.org/ is a great resource to produce high-quality, publication ready plots. Clearly, the advantage of producing the figures entirely writing code are immense in terms of reusability and reproducibility.
Donald Knuth first defined literate programming as a script, notebook, or computational document that contains an explanation of the program logic in a natural language, interspersed with snippets of macros and source code, which can be compiled and rerun
For example jupyter notebooks, R Markdown and now Quarto are literate programming frameworks to integrate code and text.
Beyond the coding part, the markup language is the core element of a literate programming framework. The idea of a markup language is separating the result from what you actually write. Some examples are:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa. Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo tipografo prese una cassetta di caratteri e li assemblò per preparare un testo campione. È sopravvissuto non solo a più di cinque secoli, ma anche al passaggio alla videoimpaginazione, pervenendoci sostanzialmente inalterato. Fu reso popolare, negli anni ’60, con la diffusione dei fogli di caratteri trasferibili “Letraset”, che contenevano passaggi del Lorem Ipsum, e più recentemente da software di impaginazione come Aldus PageMaker, che includeva versioni del Lorem Ipsum.
<h2>My Second Heading</h2>
Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa.
Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo
tipografo prese una cassetta di caratteri e li assemblò per preparare un testo campione.
È sopravvissuto non solo a più di cinque secoli, ma anche al passaggio alla videoimpaginazione, pervenendoci sostanzialmente inalterato.
Fu reso popolare, negli anni ’60, con la diffusione dei
fogli di caratteri trasferibili “Letraset”, che contenevano passaggi del Lorem Ipsum
più recentemente da software di impaginazione come Aldus PageMaker, che includeva versioni del Lorem Ipsum.
</body>
</html>
Markdown is one of the most popular markup languages for several reasons:
pandoc
## Markdown
Markdown is one of the most popular markup languages for several reasons:
- easy to write and read compared to Latex and HTML
- easy to convert from Markdown to basically every other format using `pandoc`
- easy to implement new features
Also the source code can be used, compared to Latex or HTML, to take notes and read. Latex and HTML need to be compiled otherwise they are very hard to read.
MS Word is a WYSIWYG (what you see is what you get editor) that force users to think about formatting, numbering, etc. Markup languages receive the content (plain text) and the rules and creates the final document.
Beyond the pure writing process, there are other aspects related to research data.
In MS Word (or similar) we need to produce everything outside and then manually put figures and tables.
Quarto (https://quarto.org/) is the evolution of R Markdown that integrate a programming language with the Markdown markup language. It is very simple but quite powerful.
Markdown can be learned in minutes. You can go to the following link https://quarto.org/docs/authoring/markdown-basics.html and try to understand the syntax.
The topic is extremely vast. You can do everything in Quarto, a website, thesis, your CV, etc.
The basic idea is to track changes within a folder, assign a message
and eventually a tag
to a specific version obtaining a version hystory. The version history is completely navigable, you can go back to a previous version of the code.
The are advanced features like branches
for creating an independent version of the project to test new features and then merge
into the main streamline.
The entire (local) Git project can be hosted on Github to improve collaboration. Other people or collaborators can clone
the repository and push
their changes to the project.
git init
. The folder is now a git project you can notice by the hidden .git
folder.add
files to the staging area. Basically these files are ready to be committed
i.e. “written” in the Git history.git commit -m message
Imagine to put everyting into a server with nice viewing options and advanced features. Github is just an hosting service for your git
folder.
You can create an empty repository on Github named git-test
. Now my repo has the path git@github.com:filippogambarota/git-test.git
.
Now our local repository is linked with the remote repository. Every time we do git push
our local commits will be uploaded.
If you worked on the repository from another machine or a colleague add some changes, you can do git pull
and your local machine will be updated.
The repository git-test
is online and can be seen here filippogambarota/git-test.
An now let’s see on Github the result:
There are a lot of resources online:
OSF is a free, open platform to support your research and enable collaboration.
Is a great tool to upload and share materials with others and collaborate on a project. Similarly to Github you can track the changes made to a project.
The great addition is having a DOI thus the project is persistently online and can be cited.
It is now common practice to create a OSF project supporting a research paper and put the link within the paper containing supplementary materials, raw data, scripts etc.
It’s very easy to create a new project, then you simply need to add files and share it.
The project can be accessed here (depending on the visibility) https://osf.io/yf9tg/.
An interesting feature is linking a Github repository to OSF. Now all changes made on Github (easier to manage) are mirrored into OSF. You can easily work in Github for the coding part and use OSF to upload other data or information and to assign a DOI to the project.
OSF is also linked to a popular service for preprints called PsyArXiv https://psyarxiv.com/ thus you can link a preprint to an OSF project.
In general, I highly suggest the online book The Open Science Manual https://arca-dpss.github.io/manual-open-science/ written by my friend Claudio Zandonella and Davide Massidda where these and other topics are explained in details: