Tools for reproducible research

Replicability Crisis in Science?

Filippo Gambarota

8-10 July 2024

Doing research is hard…

Doing research is hard…

  • you have to read papers, textbooks, slides and track information
  • you have to plan your experiment or research
  • you have to collect, organize and manage the research data
  • you have to analyze data, create figures and tables
  • you have to write reports, papers, slides, etc.

Doing research is hard…

Doing reproducible research is even harder… 😱

  • organize data in a sharable and comprehensible format
  • choose a future-proof place to share data along the research paper
  • analyze data using a reproducible framework: code, programming language, scripting
  • report data (papers, slides, etc.) using a reproducible framework

Reproducibility starter pack

Reproducibility starter pack 👷

  • A general purpose (or flexible enough) programming language such as or

  • A literate programming framework to integrate code and text

  • A version control system to track projects

  • An online repository to store the project and sharing with others

R Programming Language

R

R is a free software environment for statistical computing and graphics.

  • (TBH) Is not a proper general purpose programming language (such as C++ or Python).
  • New extensions (packages) allow R to do pretty everything (file manager, image processing, webscraping, etc.)
  • It is free and open-source
  • The community is extremely active and keep growing

R - CRAN

The CRAN is the repository where package developers upload their packages and other users can install them.

As the saying goes: if something exist, there is for sure an R package for doing it! 😄

R - PYPL Index

Source: https://pypl.github.io/PYPL.html
Rank Language Share 1-year.trend
1 Python 28.04% 0.30%
2 Java 15.78% -1.30%
3 JavaScript 9.27% -0.20%
4 C# 6.77% -0.20%
5 C/C++ 6.59% 0.40%
6 PHP 5.01% -0.40%
7 R 4.35% 0.00%
8 TypeScript 3.09% 0.30%
9 Swift 2.54% 0.50%
10 Objective-C 2.15% 0.10%
11 Rust 2.14% 0.50%

R - PYPL Index

The popularity is on a different scale compared to Python but still increasing:

R or Python?

  • Python is a good alternative. Personally, I use and enjoy python but I do most of my work in R.

  • Python is a very general-purpose language more powerful for general tasks.

  • I find python very useful for programming cognitive experiments, image processing, automatizing tasks and interacting with the operating system

  • R is still a little bit superior in terms of data manipulation and visualization. Python is faster and more powerful for complex machine learning.

Modern R

  • For purist programmers, R is somehow weird: arrays starts with 1, object-oriented programming is hidden, a lot of built-in vectorized functions, etc. See The R Inferno book for an overview.
  • Despite the weirdness, R have the majority of common programming paradigms (functions, conditionals, iterations, etc.).
  • The scope of this week is not providing an introduction to R but I would put the focus on two topics for a modern usage of R:
    • functional programming
    • the tidy approach

Functional Programming

In computer science, functional programming is a programming paradigm where programs are constructed by applying and composing functions.

Despite R can be used both with an imperative and object-oriented approach, the functional side is quite powerful.

Actually, functional programming is quite complex. Here we refer to breaking down our code into small functions. These functions can be function from packages, custom or anonymous functions.

Functional Programming, example…

We have a dataset (mtcars) and we want to calculate the mean, median, standard deviation, minimum and maximum of each column and store the result in a table.

head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Functional Programming

The standard (~imperative) option is using a for loop, iterating through columns, calculate the values and store into another data structure.

ncols <- ncol(mtcars)
means <- medians <- mins <- maxs <- rep(0, ncols)

for(i in 1:ncols){
  means[i] <- mean(mtcars[[i]])
  medians[i] <- mean(mtcars[[i]])
  mins[i] <- mean(mtcars[[i]])
  maxs[i] <- mean(mtcars[[i]])
}

results <- data.frame(means, medians, mins, maxs)
results$col <- names(mtcars)

results
#>         means    medians       mins       maxs  col
#> 1   20.090625  20.090625  20.090625  20.090625  mpg
#> 2    6.187500   6.187500   6.187500   6.187500  cyl
#> 3  230.721875 230.721875 230.721875 230.721875 disp
#> 4  146.687500 146.687500 146.687500 146.687500   hp
#> 5    3.596563   3.596563   3.596563   3.596563 drat
#> 6    3.217250   3.217250   3.217250   3.217250   wt
#> 7   17.848750  17.848750  17.848750  17.848750 qsec
#> 8    0.437500   0.437500   0.437500   0.437500   vs
#> 9    0.406250   0.406250   0.406250   0.406250   am
#> 10   3.687500   3.687500   3.687500   3.687500 gear
#> 11   2.812500   2.812500   2.812500   2.812500 carb

Functional Programming

We can decompose (and symplify the problem) by writing a function and looping through columns.

summ <- function(x){
  data.frame(means = mean(x), medians = median(x), mins = min(x), maxs = max(x))
}
ncols <- ncol(mtcars)
dfs <- vector(mode = "list", length = ncols)

for(i in 1:ncols){
  dfs[[i]] <- summ(mtcars[[i]])
}

results <- do.call(rbind, dfs)
results
#>         means medians   mins    maxs
#> 1   20.090625  19.200 10.400  33.900
#> 2    6.187500   6.000  4.000   8.000
#> 3  230.721875 196.300 71.100 472.000
#> 4  146.687500 123.000 52.000 335.000
#> 5    3.596563   3.695  2.760   4.930
#> 6    3.217250   3.325  1.513   5.424
#> 7   17.848750  17.710 14.500  22.900
#> 8    0.437500   0.000  0.000   1.000
#> 9    0.406250   0.000  0.000   1.000
#> 10   3.687500   4.000  3.000   5.000
#> 11   2.812500   2.000  1.000   8.000

Functional Programming

We can be even more minimalistic by removing the for loop and using the *apply family that provide a series of compact iterative method.

results <- lapply(mtcars, summ)
results <- do.call(rbind, results)
results
#>           means medians   mins    maxs
#> mpg   20.090625  19.200 10.400  33.900
#> cyl    6.187500   6.000  4.000   8.000
#> disp 230.721875 196.300 71.100 472.000
#> hp   146.687500 123.000 52.000 335.000
#> drat   3.596563   3.695  2.760   4.930
#> wt     3.217250   3.325  1.513   5.424
#> qsec  17.848750  17.710 14.500  22.900
#> vs     0.437500   0.000  0.000   1.000
#> am     0.406250   0.000  0.000   1.000
#> gear   3.687500   4.000  3.000   5.000
#> carb   2.812500   2.000  1.000   8.000

Functional Programming, *apply

The *apply family is one of the best tool in R. The idea is pretty simple: apply a function to each element of a list.

The powerful side is that in R everything can be considered as a list. A vector is a list of single elements, a dataframe is a list of columns etc.

Internally, R is still using a for loop but the verbose part (preallocation, choosing the iterator, indexing) is encapsulated into the *apply function.

means <- rep(0, ncol(mtcars))
for(i in 1:length(means)){
  means[i] <- mean(mtcars[[i]])
}

# the same with sapply
means <- sapply(mtcars, mean)

for loops are bad?

for loops are the core of each operation in R (and in every programming language). For complex operation thery are more readable and effective compared to *apply. In R we need extra care for writing efficent for loops.

Extremely slow, no preallocation:

res <- c()
for(i in 1:1000){
  # do something
  res[i] <- x
}

Very fast, no difference compared to *apply

res <- rep(0, 1000)
for(i in 1:length(res)){
  # do something
  res[i] <- x
}

Why functional programming?

  • We can write less and reusable code that can be shared
  • The scripts are more compact, less errors prone and more flexible (imagine that you want to improve the summ function, you only need to change it once instead of touching the for loop)
  • Functions can be easily and consistently documented (see roxygen documentation) improving the reproducibility and clarity of the code

More about functional programming in R

Advanced RHands-on Programming With R

The Tidy approach

The tidyverse is a series of high-quality R packages to do modern data science:

  • data manipulation (dplyr, tidyr)
  • plotting (ggplot2)
  • reporting (rmarkdown)
  • string manipulation (stringr)
  • functionals (purrr)

The Tidy approach - Pipes

One of the great improvement from the tidyverse is the usage of the pipe %>% now introduced in base R as |>. You will se these symbols a lot when looking at modern R code.

The idea is very simple, the standard pattern to apply a function is function(argument). The pipe can reverse the pattern as argument |> function(). Normally when we apply multiple functions progressively the pattern is this:

x <- rnorm(100)
x <- round(x, 3)
x <- abs(x)
x <- as.character(x)

The Tidy approach - Pipes

When using the pipe, we remove the redundand assignment <- pattern:

x <- rnorm(100)
x |>
  round(3) |>
  abs() |>
  as.character()

The pipe can be read as “from x apply round, then abs, etc.”. The first argument of the piped function is assumed to be the result of the previus call.

More about the Tidy approach

The tidy approach contains tons of functions and packages. The overall philosopgy can be deepen in the R for Data Science book.

https://r4ds.hadley.nz/

ggplot2

Only an honour mention to ggplot2 https://ggplot2-book.org/ (part of the tidyverse) that is an amazing package for data visualization following the piping and tidy approach. Is the implementation of the grammar of graphics idea.

library(tidyverse)
iris |>
  mutate(wi = runif(n())) |>
  ggplot(aes(x = Sepal.Length, y = Petal.Width, color = Species)) +
  geom_point(aes(size = wi)) +
  geom_smooth(method = "lm", se = FALSE)
  guides(size = "none") +
  theme_minimal(15)

More verbose, more hard coding, more steps and intermidiate objects.

iris_l <- split(iris, iris$Species)
lms <- lapply(iris_l, function(x) lm(Petal.Width ~ Sepal.Length, data = x))

plot(iris$Sepal.Length, iris$Petal.Width, col = as.numeric(iris$Species), pch = 19)
abline(lms[[1]], col = 1, lwd = 2)
abline(lms[[2]], col = 2, lwd = 2)
abline(lms[[3]], col = 3, lwd = 2)
legend("topleft", legend = levels(iris$Species), fill = 1:3)

ggplot2

The ggplot2 book https://ggplot2-book.org/ is a great resource to produce high-quality, publication ready plots. Clearly, the advantage of producing the figures entirely writing code are immense in terms of reusability and reproducibility.

Literate Programming

Literate Programming1

Donald Knuth first defined literate programming as a script, notebook, or computational document that contains an explanation of the program logic in a natural language, interspersed with snippets of macros and source code, which can be compiled and rerun

For example jupyter notebooks, R Markdown and now Quarto are literate programming frameworks to integrate code and text.

Heading

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor

Heading...

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor

Lorem ipsum dolor sit amet,...
Plot Code
Plot Code
Markup
Markup
Markup
Markup
...
...
...
...
Text is not SVG - cannot display

Literate Programming, the markup language

Beyond the coding part, the markup language is the core element of a literate programming framework. The idea of a markup language is separating the result from what you actually write. Some examples are:

  • LaTeX
  • HTML
  • Markdown
  • XML

LaTeX 1

HTML

<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa. Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo tipografo prese una cassetta di caratteri e li assemblò per preparare un testo campione. È sopravvissuto non solo a più di cinque secoli, ma anche al passaggio alla videoimpaginazione, pervenendoci sostanzialmente inalterato. Fu reso popolare, negli anni ’60, con la diffusione dei fogli di caratteri trasferibili “Letraset”, che contenevano passaggi del Lorem Ipsum, e più recentemente da software di impaginazione come Aldus PageMaker, che includeva versioni del Lorem Ipsum.

<h2>My Second Heading</h2>

Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa. 

Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo 

tipografo prese una cassetta di caratteri e li assemblò per preparare un testo campione. 

È sopravvissuto non solo a più di cinque secoli, ma anche al passaggio alla videoimpaginazione, pervenendoci sostanzialmente inalterato. 

Fu reso popolare, negli anni ’60, con la diffusione dei 

fogli di caratteri trasferibili “Letraset”, che contenevano passaggi del Lorem Ipsum

più recentemente da software di impaginazione come Aldus PageMaker, che includeva versioni del Lorem Ipsum.

</body>
</html>

Markdown1

Markdown

Markdown is one of the most popular markup languages for several reasons:

  • easy to write and read compared to Latex and HTML
  • easy to convert from Markdown to basically every other format using pandoc
  • easy to implement new features

Markdown (source code)

## Markdown

Markdown is one of the most popular markup languages for several reasons:

- easy to write and read compared to Latex and HTML
- easy to convert from Markdown to basically every other format using `pandoc`
- easy to implement new features

Also the source code can be used, compared to Latex or HTML, to take notes and read. Latex and HTML need to be compiled otherwise they are very hard to read.

What’s wrong about Microsoft Word?

MS Word is a WYSIWYG (what you see is what you get editor) that force users to think about formatting, numbering, etc. Markup languages receive the content (plain text) and the rules and creates the final document.

What’s wrong about Microsoft Word?

Beyond the pure writing process, there are other aspects related to research data.

  • writing math formulas
  • reporting statistics in the text
  • producing tables
  • producing plots

In MS Word (or similar) we need to produce everything outside and then manually put figures and tables.

The solution… Quarto

Quarto (https://quarto.org/) is the evolution of R Markdown that integrate a programming language with the Markdown markup language. It is very simple but quite powerful.

Basic Markdown

Markdown can be learned in minutes. You can go to the following link https://quarto.org/docs/authoring/markdown-basics.html and try to understand the syntax.

Let’s see a practical example!

More about Quarto and R Markdown

The topic is extremely vast. You can do everything in Quarto, a website, thesis, your CV, etc.

Git and Github

Git and Github

The basic idea is to track changes within a folder, assign a message and eventually a tag to a specific version obtaining a version hystory. The version history is completely navigable, you can go back to a previous version of the code.

The are advanced features like branches for creating an independent version of the project to test new features and then merge into the main streamline.

The entire (local) Git project can be hosted on Github to improve collaboration. Other people or collaborators can clone the repository and push their changes to the project.

Veeeery basic Git workflow

Veeeery basic Git workflow

  • After installing Git, you can start a new repository opening a terminal on a folder and typing git init. The folder is now a git project you can notice by the hidden .git folder.
cd ~/some/folder
git init
  • Then you can add files to the staging area. Basically these files are ready to be committed i.e. “written” in the Git history.
git add file1.txt
# git add . # add everyting
  • Finally you can commit the modified version of the file using git commit -m message
git commit -m "my first amazing commit"
  • you can see the Git hystory with all your commits:
git log

Github

Imagine to put everyting into a server with nice viewing options and advanced features. Github is just an hosting service for your git folder.

You can create an empty repository on Github named git-test. Now my repo has the path git@github.com:filippogambarota/git-test.git.

git remote add origin git@github.com:filippogambarota/git-test.git
git push

Now our local repository is linked with the remote repository. Every time we do git push our local commits will be uploaded.

If you worked on the repository from another machine or a colleague add some changes, you can do git pull and your local machine will be updated.

The repository git-test is online and can be seen here filippogambarota/git-test.

Github

An now let’s see on Github the result:

More about Git and Github

There are a lot of resources online:

  • The Open Science Manual - Zandonella and Massidda - Git and Github chapters.
  • https://agripongit.vincenttunru.com/
  • https://git-scm.com/docs/gittutorial

Open Science Framework

Open Science Framework

OSF is a free, open platform to support your research and enable collaboration.

Is a great tool to upload and share materials with others and collaborate on a project. Similarly to Github you can track the changes made to a project.

The great addition is having a DOI thus the project is persistently online and can be cited.

It is now common practice to create a OSF project supporting a research paper and put the link within the paper containing supplementary materials, raw data, scripts etc.

Open Science Framework

It’s very easy to create a new project, then you simply need to add files and share it.

The project can be accessed here (depending on the visibility) https://osf.io/yf9tg/.

Open Science Framework

OSF and Github

An interesting feature is linking a Github repository to OSF. Now all changes made on Github (easier to manage) are mirrored into OSF. You can easily work in Github for the coding part and use OSF to upload other data or information and to assign a DOI to the project.

Preprints

OSF is also linked to a popular service for preprints called PsyArXiv https://psyarxiv.com/ thus you can link a preprint to an OSF project.

More on OSF

  • https://help.osf.io/article/342-getting-started-on-the-osf
  • https://arca-dpss.github.io/manual-open-science/osf-chapter.html

More on reproducibility

In general, I highly suggest the online book The Open Science Manual https://arca-dpss.github.io/manual-open-science/ written by my friend Claudio Zandonella and Davide Massidda where these and other topics are explained in details: