Authors: Mar Barrantes-Cepas, Eva van Heese, Eva Koderman, Diana Bocancea, Lucas Baudouin; Reviewers: Chris Vriend
In this chapter, we’ll show you practical tools and software to help make your neuroscience research more reproducible. By using scripts instead of graphical user interfaces, open source software, and version control, you’ll not only make your work easier to manage, but also ensure others can replicate your findings and easily collaborate with you. This way you can easily follow the FAIR principles for improving Findability, Accessibility, Interoperability, and Reusability (see Chapter 3 for more details on the FAIR principles).
Reproducibility is at the core of good science – it helps move the field forward by making sure discoveries can be verified and expanded upon. Increasing reproducibility in neuroscience can be approached in two ways: ‘top-down’, where institutions reshape incentives and frameworks (i.e. changes at the meso-level, see here, and from the ‘bottom-up‘, where individual researchers adopt better practices. While both approaches are necessary, this chapter focuses on the bottom-up approach, providing you with the tools and guidelines to enhance your work.
Although sharing data in clinical neuroscience can be challenging due to privacy, legal concerns and logistical barriers, this shouldn’t be an excuse not to share your code and materials. Developing your work with this in mind, you can help promote transparency and reproducibility by ensuring that your code is shareable and accessible to others.
You might already know much of what will be discussed in the following section. Should that be the case, you can browse through the headers to double-check your knowledge! Otherwise, here is a summary of some concepts you need for best coding practices.
A Graphical User Interface (GUI) is a digital interface that allows users to interact with graphical elements such as icons, buttons, and menus (e.g., SPSS or MATLAB). GUIs are user-friendly because they provide intuitive visual cues for navigation and task execution. However, they are less effective for reproducibility, as it can be challenging to track or recall the exact steps and parameters used during analysis if you don’t note them somewhere. Moreover, scripts provide more flexibility, can optimize compute efficiency through job parallelization, and require less manual work, resulting in greater control over your data.
To address these issues, it is advisable to use scripts for your methodology. Scripts provide a record of all actions taken and parameters used, making it easier to reproduce and share your work with others. Fortunately, many software packages and pipelines also offer the option to execute commands directly through a terminal. For instance, if properly installed, FSL commands can run from the terminal. To learn more about this, consult the log files or documentation specific to the tool you are using.
In programming, just as in everyday life, a wide array of languages are available for writing your scripts—more than you might imagine! Check out the [List of programming languages - Wikipedia](https://en.wikipedia.org/wiki/List_of_programming_languages. The most commonly used languages for data analysis in neuroscience are Bash, C++, Python, MATLAB, and R. An additional language that can help with your publication manuscript is LaTeX. The choice of language often depends on your personal preferences and the specific needs of your project. In this section, we outline the main differences between these languages, discuss Open Science-related considerations, and offer tips for maximizing the benefits of each.
Bash is excellent for automating command-line tasks and system administration. It enables you to execute and automate terminal commands and call various tools through scripts.
C++ is used for computationally intensive projects, and most command-line tools are programmed with it due to its performance capabilities.
Python, MATLAB, and R are high-level languages, meaning they are easy to use, understand, portable, and independent from specific hardware. Python is a versatile and user-friendly programming language, making it an ideal choice for data analysis. R is designed specifically for statistical analysis and data visualisation, making it popular among statisticians and data scientists. MATLAB excels in numerical computation and visualisation but requires a paid licence.
A programming tool that is not strictly speaking a data analysis pipeline development tool but can still help you in the preparation of your manuscript for publication is LaTeX. Some journals even offer their LaTeX templates! It is specifically useful when your manuscript contains formulas, graphs that are still in the making, or pieces of code, since it allows you to easily add everything beautifully without spending too much time looking for the correct character. Different editors, such as Overleaf or Visual Studio Code, will enable you to use it. Some extra tools that will enhance your experience with LaTeX are Detexify, which helps to find characters you might not know how they are called, or a [Tables converter] (https://www.tablesgenerator.com/) into LaTeX format.
An extra thing to consider when choosing your programming language is your carbon emission when coding. High-level languages, like Python, tend to consume more energy and need more time to run than compiled languages, like C.
But there’s more to consider! Besides programming languages, you’ll also need to manage libraries.
Libraries are collections of pre-written code that extend the functionality of a programming language, simplifying complex tasks. Just as programming languages have different versions, libraries can also have multiple versions due to updates and bug fixes. When multiple people work on the same coding file (see below - Version control), it is important to use consistent versions of programming languages and libraries across the team. In addition, different versions of libraries may introduce, change, or remove functions, so a specific function might only work with a particular version due to compatibility requirements. Software tools can act as GUIs to simplify data analysis. However, tools can be built using programming that is not open source (i.e., Matlab) and therefore the tools themselves are also not open source.
Virtual environments and containers are tools used in software development to create isolated and controlled environments for running applications and managing dependencies.
A virtual environment in Python is an isolated environment that allows you to install and manage dependencies for a specific project without affecting the global Python installation or other projects. It helps ensure that each project can have its dependencies and versions, avoiding conflicts between projects.
A container is an isolated unit, and is much more comprehensive tool that isolates not just the programming environment but the entire software environment, including the operating system, system libraries, runtime, and application code - making it more versatile for deploying and running consistent environments across different systems. Containers offer several advantages (reference:
Containerised software is particularly useful in neuroscience research because it guarantees that processing pipelines run reliably and uniformly across different computing environments without researchers worrying about variations in software dependencies or system configurations, for example in collaborations between different institutes. This consistency is crucial for reproducibility in research.
Useful open-source tools within science also include LibreOffice and Inkscape. LibreOffice is a free and open-source alternative to Microsoft Office applications like Word, PowerPoint, Excel, and Access. It offers similar functionalities for document creation, presentations, spreadsheets, and database management. For poster creation or data visualisation, you can opt for Inkscape. It is a free and open-source vector graphics editor that is widely used for creating and editing scalable vector graphics (SVG) files.
To make your project as open sciency as possible, we provide a few tips:
This section offers guidance on optimising version control and annotation practices. It covers best practices for streamlining version control, how to integrate them within your team, and the ideal workflow to adopt for maximum efficiency.
When working on a script, it is important to annotate your code. Annotation is essential to make code understandable, discoverable, citable, and reusable. Check out Chapter 3 to obtain a better general understanding of code annotation. More specific to code annotation, it is important to keep in mind the following:
To help get you started, you can check out these (script templates)[https://github.com/marbarrantescepas/script-templates], guidance, and examples. This tool is also useful for formatting your code (and making it beautiful!) - Black Vercel.
Version control is a method used to document and manage changes to a file or collection of files over time. It allows you and your collaborators to monitor the history of revisions, review modifications, and revert to previous versions when necessary. This is useful, especially when working together on a script. The most prevalent version control system that can help with that is Git.
Git is a version control system that tracks file changes. This can be helpful when working on your own scripts, as well as for the coordination of work among multiple people on a project. GitHub and GitLab are web-based platforms that host Git repositories, along with additional features like issue tracking, code reviews, and continuous integration. The main difference between them is that GitHub is more focused on open-source collaboration and has a large user community, while GitLab offers more built-in tools and is known for its flexibility in deployment options, including self-hosting. Both of them allow the creation of private and public repositories.
:tulip: If you want to learn more about pros and cons and the current status of Git(-related) tools at Amsterdam UMC, please check this link. Check with your (co)supervisors about the best option to use or if they already have an account for the group. If the account hasn’t been created yet, take the initiative and set it up yourself by following the simple instructions below.
To create a GitHub account, link it with Git on your local machine, and verify the connection, follow these steps:
git config --global user.name "YourGitHubUsername"
git config --global user.email "your_email@example.com"
Generate an SSH key (if you want to authenticate using SSH, recommended for security):
ssh-keygen -t ed25519 -C "your_email@example.com"
cat ~/.ssh/id_ed25519.pub
ssh -T git@github.com
If successful, you’ll see a message like: Hi username! You've successfully authenticated.
Now Git is linked to your GitHub account, and you can push, pull, and collaborate on projects directly from your local machine. Not sure what these terms mean? Check below!
Once you have a good feeling of the Github lingo, the version control should be easy peasy. Here are some basics on the terminology and a tutorial to help get you started:
Here is a quick tutorial to help get you started with the basics and a visual representation of a github workflow.
The main (or previously called the master branch) is where your code lives as the main character. All the other branches are created for the development of a specific feature (or you can think of them as side quests). After the feature development is complete and the code is fully tested and functional, you can merge it back into the main branch. Continue this process until all the feature development is complete.
Before creating a new GIT repository and linking it to GitHub or GitLab, it is EXTREMELY important to make a .gitignore file. Without it, all files in your project, including potentially sensitive or personal data, will be tracked and uploaded by default. This could lead to unintentional data exposure that cannot be shared due to privacy regulations.
To avoid this, make sure to create a .gitignore file listing all the file types and folders you want to exclude from version control. You can find more detailed guidance in the Ignoring files - GitHub Docs and browse Some common .gitignore configurations for examples.
A README is a text file that introduces and explains a repository because no one can read your mind (yet). If you want to learn how to properly create a README file check: Make a README (they also provide templates!)
A CONTRIBUTING.md file is a document placed in the root of a project that provides clear guidelines for anyone who wants to contribute to the project. It explains the different ways people can help, such as reporting bugs, suggesting features, improving documentation, or submitting code, and outlines the steps for setting up the project locally, following coding standards, and submitting pull requests. Including this file helps create a welcoming environment, sets expectations, and makes collaboration easier and more organized.
To check more about licences and licensing, check Chapter 3.
A general code workflow should include several iterations of peer review and end with the scripts being uploaded on GitHub. Peer review ensures that the code is correct and functions well, while the publication of scripts online ensures these are shared with the wider scientific community and improves the reproducibility. In scripts, correctness refers to the code’s ability to produce the intended results accurately according to specified requirements. In contrast, reproducibility ensures that these results can be consistently obtained by different users or in different environments when the same code is run with identical inputs. While correctness confirms that the code functions as intended, reproducibility guarantees that the outcomes can be reliably replicated, which is essential for validating research findings.
The general workflow of code review that ensures correctness and reproducibility is summarized in the figure below:
As you might notice, there should always be a code owner and a code reviewer who have separate tasks.
For the code owner:
For the reviewer: Please confirm that the code is understandable and well-documented. It is not your job to rewrite the code for the code owner or to test the code’s functionality. If you have to spend too much time on it, send it back to the owner with your remarks and ask them to improve it before your final revision.
Additional good coding practice, in addition to the code review, is code testing. Code-based testing encompasses various methods to ensure software reliability and quality. Techniques like unit and integration testing help developers validate their code. Unit testing checks individual components of code for correctness, while integration testing ensures that combined components work together as expected.
A key element of code testing is building a solid foundation of tests that cover different scenarios and edge cases. These tests serve as a safety net, offering ongoing feedback on code functionality. By thoroughly testing, developers can catch and fix issues early, reducing the time and effort needed for debugging and maintenance.
Ensuring the quality of your data is a crucial step in minimising errors and avoiding mistakes that could affect the validity of your research. This applies to all types of data in neuroscience, not just large datasets like neuroimaging, but also to more basic data such as demographics, behavioural scores, or clinical outcomes. Even seemingly simple data can contain errors that may go unnoticed without careful inspection. It’s important to invest time in performing sanity checks, validating your data, and identifying potential errors early on.
As emphasized in this chapter, adopting ‘bottom-up’ practices, like using scripts, version control, and code review/code test, can help you create more reproducible workflows which can significantly increase the reliability of your findings. Quality assessment is at the core of these practices - helping you catch errors, identify inconsistencies, and ensure that your data is solid, facilitating transparency and collaboration. Cleaning and assessing your data thoroughly can prevent small issues from snowballing into larger problems down the line. By prioritising data quality control (QC) at critical steps during your analysis, you set the foundation for reliable and reproducible research. Below we guide you through practical approaches to assess data quality, from visualizing distributions to performing neuroimaging checks, while highlighting open-source tools that make these tasks more efficient and accessible.
Descriptive statistics and data visualisation are critical first steps in assessing the quality and distribution of your dataset. Visualizing your data, you can quickly identify outliers, assess distributions, and spot inconsistencies that might not be obvious by the raw values. While the exact procedures are dataset and modality specific, here are some general guidelines and examples for visualizing data for quality checks, along with the tools to implement them:
There are various open-source tools that provide functionalities to easily check and visualize your data:
Tool | Language | Purpose |
---|---|---|
OpenRefine | N/A | Cleaning and validating structured datasets, especially for demographics or clinical data, as it allows for easy spotting and correction of inconsistencies |
Pandas | Python | Data manipulation and summary statistics |
Matplotlib / Seaborn | Python | Tools for creating basic (Matplotlib) and advanced (Seaborn) visualizations |
Plotly | Python | Creating interactive and dynamic visualizations |
Pandas profiling | Python | Automates the generation of detailed data reports, useful for data exploration |
dplyr | R | Simplifies data manipulation and allows for descriptive statistics calculation |
ggplot2 | R | High-quality visualization tool widely used for creating publication-ready plots |
janitor | R | Cleaning messy data, removing empty rows, renaming columns, and identifying duplicates |
DataExplorer | R | Generating comprehensive data reports to identify missing data, outliers, and distributions |
Quality control of structural neuroimaging data is important as errors in brain segmentation can lead to inaccurate volume/thickness estimates. Visual inspection is still the gold standard, but as dataset sizes grow, this approach becomes more time-consuming. Thus, automated QC tools are becoming more necessary. Researchers are actively developing tools to handle QC across different imaging modalities, and while many tools are still in development, there are some noteworthy options already available. Automated QC methods are an ongoing area of development, and it’s essential to stay updated on new tools that can improve the QC process in neuroimaging.
Software/Tool | Purpose | Openness? |
---|---|---|
MRIQC | Open-source tool designed to perform automated quality control on MRI datasets, providing reports on data quality across various MRI metrics. | fully open |
QSIPREP | Quality control and processing of diffusion-weighted MRI data. | fully open |
ENIGMA QC Protocols | Protocols and guidelines defined by the ENIGMA consortium for performing visual quality checks on segmented MRI data from Freesurfer. | fully open |
SPM - Matlab | Offers visualization tools for quality checking segmentation outputs. Extensions like CAT12 provide additional QC functionalities. | Tool itself is open, but requires Matlab (paid licence) |
SPM - Python | SPM tool described as above but translated to Python and made fully accessible. | fully open |
fMRIPrep | While primarily used for preprocessing, fMRIPrep includes built-in QC features that help flag problematic scans in fMRI datasets, offering both visual reports and metrics for each scan. | fully open |
FSQC | Open-source tool designed to perform quality assessment of FreeSurfer outputs. | fully open |
This section is coming soon
To anonymize and deface brain scan data (typically MRI or CT scans in DICOM or NIfTI formats), several well-established tools are used in neuroimaging research. These tools help remove or obscure facial features and metadata that could be used to identify participants, which is essential for complying with privacy regulations. Theyers and colleagues (2021) found that the defacing algorithms provided below vary in their defacing efficiency. Specifically, their analysis shows that the afni_reface and pydeface had the highest accuracy rates. Keep this in mind when choosing your own defacing header!
Tool Name | Description & Link | Openly Accessible |
---|---|---|
pydeface | Defaces NIfTI MRI scans using a pre-trained face mask | ✅ Yes |
mri_deface | Part of FreeSurfer; uses anatomical templates to remove facial features | ✅ Yes (FreeSurfer required) |
fsl_deface | FSL tool to deface T1-weighted images using probabilistic masks | ✅ Yes (FSL license) |
dcmodify | DCMTK command-line tool to modify or anonymize DICOM headers | ✅ Yes (Source-available) |
dcm2niix | Converts DICOM to NIfTI and can remove private metadata | ✅ Yes |
dicom-anonymizer | Java-based GUI tool for anonymizing DICOM files | ✅ Yes |
Within the neuroimaging field of neuroscience, many tools are open source Some require a free licence that can be requested. We’ll review a selection of often applied software and tools:
Software/Tool | Type of Data | Purpose | Openness? | Recommended Tutorials and Resources |
---|---|---|---|---|
FSL | Several types of MR images | Process or view images from a variety of modalities | fully open | FSL Course - YouTube |
FreeSurfer | Anatomical T1w/T2w images | Perform cortical, subcortical and subfield parcellation | free licence required (request here) | Introduction to FreeSurfer - YouTube |
fMRIPrep | Functional images | Perform fMRI preprocessing steps | fully open | How to Use fMRIPrep - YouTube |
ANTs | Several types of MR images | Perform registration, segmentation, and brain extraction of different images | fully open | Andy’s Brain Book - Advanced Normalization Tools (ANTs) |
CONN | Functional images | Perform functional connectivity analysis | fully open | conn-toolbox - YouTube |
DIPY | Diffusion images | Perform analysis on diffusion images | fully open (Python toolbox) | |
EEGLAB | EEG and other signal data | Perform signal analysis (from preprocessing to statistical analysis) | Tool itself is open, but requires Matlab (paid licence) | EEGLAB - YouTube |
NeuroKit2 | EEG and other signal data | Perform signal analysis (from preprocessing to statistical analysis) | fully open (Python toolbox) | |
MNE-Python | EEG signal processing | Perform signal analysis (from preprocessing to statistical analysis) | fully open (Python toolbox) | MNE YouTube tutorial |
PyNets | Structural and functional connectomes | Perform sampling and analysis for individual structural and functional connectomes | fully open (Python toolbox) | |
Spinal Cord Toolbox | Anatomical T1w/T2(*)weighted images | Perform automated spinal cord segmentation | fully open | Tutorials - Spinal Cord Toolbox documentation |
SPM | Time-series (fMRI, PET, SPECT, EEG, MEG) | Construction and assessment of spatially extended statistical processes used to test hypotheses about functional imaging data | Tool itself is open, but requires Matlab (paid licence) | Andy’s Brain Book - SPM Overview |
DSI-Studio | Diffusion images | Perform analysis on diffusion images | fully open | DSI Studio Workshop - YouTube |
MRtrix3 | Diffusion images | Perform analysis on diffusion images | fully open | Diffusion Analysis with MRtrix - YouTube |
CAT12 | Anatomical T1w/T2w images | Perform diverse morphometric analyses such as VBM, SBM, DBM, RBM | Tool itself is open, but requires Matlab (paid licence) & SPM | Andy’s Brain Book - VBM in CAT12 |
QSIPREP | Diffusion images | BIDS-compatible preprocessing pipeline that standardizes and automates processing of diffusion MRI data, including denoising, motion correction, and reconstruction to prepare for analysis | fully open |
This section is coming soon