Chapter 3: Open Data in Neuroscience

Authors: Mar Barrantes-Cepas, Eva van Heese; Reviewers: Janneke Lemmerzaal

Data Definition

First, what is data? Neuroscience data encompasses everything you use or produce during your research life, from datasets, software, code, workflow, models, figures, tables, articles, etc. Everything contains a piece of information, such as brain imaging techniques (for instance, MRI, PET, EEG, MEG), cognitive and behavioural data, and cellular, histological, and molecular evaluations of brain samples from humans or animals. All this research and clinical data follows a lifecycle that involves data creation, use, publication and sharing, archiving, and re-use or destruction.

The scope of the present chapter includes data organisation and sharing, all information related to software, code, and workflows is explained in Chapter 4.

Personal data

Personal data in neuroscience refers to an individual’s personal information that can identify the person, i.e. birth date, raw MRI scan, security number, etc. The use of personal data has significant implications for privacy and ethical considerations. Personal data are not always as evident or accessible as we think. For example, DICOM headers might contain the person’s name or date of birth, or from structural MRI scans, we can easily identify the person’s face if it’s not defaced (see software and tools in Chapter 4). As researchers, we must ensure responsible handling of personal data, by anonymizing our data properly in case of storing and sharing it, following the ethical protocols. If you’re allowed to share data, you may also consider sharing metadata or documentation as part of your transparency efforts. We advise you to consult with the legal department of your institution before sharing anything.

Documentation and Metadata

For datasets

Data available with no context that cannot be understood is of no use. However, metadata and documentation can provide provenance and context to datasets. Therefore, you should ensure that any dataset, but especially open datasets, includes consistent metadata and documentation that provides insightful information about how the data was collected. Metadata is often intended for machine reading, but some forms are human-readable. Basic metadata can be described on the level of the dataset (title, study summary, keywords, contact information, licence) and data elements (variable name, description, unit, modification date). Advanced metadata may describe the frequency of updates, temporal coverage (cross-sectional, longitudinal, follow-up duration), and spatial coverage (i.e. geographic area).

For data processing and analysis

For data processing and analysis, keeping organised and descriptive documentation about the step-by-step process is key. Whereas everyone generally does this in their own manner, the movement of open data and open scripting has set some standards on how to improve documentation for yourself and others. A couple of essential elements to include in your documentation are:

  • a summary of the general steps (i.e from raw data to outcome)
    • for example in a flowchart
  • the software, automated pipelines, scripts or other tools applied
    • specify version, platform, and other environmental details
    • specify decisions made (manual settings)
    • for code: provide examples from your actual code
    • correctly cite these tools and their documentation
  • troubleshooting: issues you encountered and their solutions
    • describe common mistakes
    • describe steps that require a cautious approach
    • cite solutions or workarounds provided online
  • quality assessment
    • which checks should and could be performed?
    • how does good/bad data look? provide examples (images!)

Documentation will look different for each analysis and can be described in many formats: a (shared) document, spreadsheet, or environment that allows for files and code to be incorporated (for example Github). Important is that the documentation can be understood by anyone (with the expected background knowledge) and used as a guide to perform the analysis described.

Creative Commons and data licensing

Licenses provide a standardized way for others to use your creative work under copyright legislation. The different Creative Commons (CC) licenses describe what permissions the public has to do with your work. The license types are formed by combinations of sharing principles:

  • BY: credit must be given to the creator
  • NC: only non-commercial uses of the work are permitted
  • SA: Adaptations must be shared under the same terms
  • ND: No derivatives or adaptations of the work are permitted

Together, they result in six license types (CC BY-SA, CC BY-NC, CC BY-NC-SA, CC BY-ND, CC BY-NC-ND). A seventh license type (CC Zero or CC0) allows the creator to give up their copyright and all public users to adapt, build upon, and distribute the original work without any conditions. You can consult the online tool Licence Chooser from CC if you struggle to choose a license for your work. Interested to read more about CC licenses? Check out this page.

Data Management Plan

How to manage your data is usually described in a Data Management Plan (DMP). It provides information on six main topics: (i) roles and responsibilities, (ii) type and size of the data collected and documentation or metadata generated, (iii) type of data stored used and backup procedures that are in place, (iv) preservation of the research outputs after the project, (v) re-use of your research outputs by others and (vi) costs (see more details at Turing Way).

Before creating your own DMP, we recommend consulting with your institution to see if they have a specific template for the DMP. If you are from the Amsterdam UMC, please check their page. Moreover, we recommend creating a checklist/roadmap before starting your research with the steps that need to be performed in order to make your project as according to open science principles, and the requirement or things you need to check for each step. For example, can everyone from your team understand how the data is organised or where to find it with the minimum knowledge? More information can be found at Turing Way or in the case of Amsterdam UMC.

FAIR principles

It’s recommended to follow the FAIR principles in data for improving Findability, Accessibility, Interoperability, and Reusability. If you plan it from the beginning, it is easier to make data FAIR. Making data FAIR is not the same as making it open.

Keep in mind that data should be as open as possible and as closed as necessary.

Accessible means that there is a procedure in place to access the data that could benefit sharing data or methods within your own group or department and re-use existing pipelines without having to put much effort into finding data.

Data Storage and Organisation

Storing your data correctly is important to prevent data loss, which happens more often than we would like. To avoid data loss, it’s recommended to pick a suitable storage system and back up your data frequently. Your institution will usually provide information on how to store your data. Consult which are the different storage systems your institution is using and how you can back up data correctly. You can either consider using cloud storage if your data protection allows it or you can consider encrypting your data before storage.

At the same time, organising data in a meaningful and FAIR way may be challenging, especially at the beginning of new projects where we are not fully sure of all the intermediate files. You should use a clear folder structure to ensure you can find your files.

In the following paragraphs, we divided our recommendations depending on which kind of neuroscience data you are mainly working with brain imaging (MRI, PET, EEG, MEG), cognitive, behavioural, cellular, histological, or molecular data.

Brain Imaging Data Standards

Brain Imaging Data Standards (BIDS) is a community-driven consensus on how to organise and share data obtained in neuroimaging experiments (everybody can become part of the community). Lack of consensus led to time wasted on rearranging data or rewriting scripts in a certain way. More information about the guidelines and specifications can be found at [BIDS webpage] (https://bids.neuroimaging.io/).

BIDS project folder structure

Briefly, each project has a main folder containing a sourcedata, rawdata and derivatives folder, where different types of data will be stored. Sourcedata is meant to be for data before any kind of conversion, reconstruction and/or harmonisation, rawdata is expected to contain the data converted to NIFTI and JSON format, and finally, derivatives should contain all the files derived from your analysis.

Figure 3.1 - Structure of the project folder organised according to BIDS format. Image from [1].

Inside sourcedata and rawdata, we should have a folder for each subject of the study named sub-SUBID, where SUBID is the code or identifier of that particular participant, and a tsv file containing the information of our dataset. Inside the subject folder, it is expected to have a subfolder per session in case of longitudinal data. We would recommend adding a session folder even in cross-sectional studies. You never know if it will be longitudinal later on! Inside the session (if it exists) or the subject folder, we should have a subfolder for each modality i.e. anat, func, dwi, etc., containing anatomical, functional, diffusion or other type of data respectively. See more information about the specifications of different modalities.

BIDS file naming structure

Finally, the files MUST be named in a certain way to be machine-readable. There are three main types of data (or extensions): .json files containing metadata, .tsv files containing tables of metadata and raw data images (with .jpg or .nii.gz). All files follow a similar structure that includes using keys, the corresponding value to that key, a suffix and, finally, the extension. Keys are always paired with values, some of them are mandatory to have, for instance, the subject name (i.e. key would be sub- and the value is the corresponding SUBJID), however, there are also some of them recommended but not mandatory. Suffixes are mandatory to have and they indicate the kind of data. For a given suffix, some entities are required. More information about the specifications for each kind of data can be found here.

Figure 3.2 - Structure of the file name following the BIDS format. Image from [2]

If at this point you are lost and don’t know where to start, we recommend you check the BIDS starter kit and consult experienced people on the field or the BIDS community.

BIDS converter and validation tools

There are some tools that organise DICOM data directly to BIDS format. From our personal experience, we recommend BIDScoin and dcm2bids. However, there are more options that can be used for converting your source data into BIDS format. Find more about these tools at https://bids.neuroimaging.io/benefits.html#software-currently-supporting-bids.

Furthermore, there is a validator available to check if your data is correctly organised, called BIDS validator. Although the purpose of this tool is useful, the validator can sometimes be discouraging because it is not up to date with the latest recommendations. If you struggling with the validator, please check it with someone more experienced or contact the BIDS community directly.

BIDS derivatives

Coming soon

BIDS citations

Be nice and don’t forget to cite in your study the BIDS citations if you are using them!

Clinical data management

For clinical, demographic, and behavioural data, accurate and meticulous data management is essential to create a high-quality database for statistical analysis. Procedures to ensure high-quality standards include database designing, data entry, data annotation, data validation, discrepancy management, and database locking. A review article highlights the processes and recommended tools for clinical data management (Krishnankutty et al., 2012). Common software for Electronic Data Capture (EDC) include: REDcap, Castor, Greenlight Guru Clinical, Medidata Rave, Clinion.

Spreadsheets and documents are widely used for various purposes including collecting, storing, manipulating, analyzing, and documenting research data. However, it’s important to exercise caution as improper use of them can lead to significant errors in workflows. Our recommendation is to follow the Turing Way.

BIDS in Microscopic Data

Coming soon

Big Genetic Datasets

Coming soon

Work in progress

For some data types, there is no current standard yet. Efforts are ongoing in various fields to develop standardized approaches for data storage and sharing. While there’s still progress to be made, we recommend reaching out to colleagues engaged in similar work or data. Collaborating and sharing experiences can provide valuable insights into effective data storage practices.

Sharing your own data

Besides sharing your results in a manuscript, it is becoming good practice to also share your methods (scripts, pipelines) and even your data (raw or processed) with the world. This way, others can apply your methods to their own data, or even replicate or further investigate your results in the original dataset. We’ve already discussed the sharing of analysis documentation earlier in this chapter. Several platforms exist to facilitate the sharing of raw and processed data.

DANS: Data Archiving and Networked Services (Dans-KNAW)

DANS is a service offered by the Dutch centre and repository for research data, which supports researchers in making their data available for them to be reused. DANS is not specific to neuroscience but ensures sustainable access to research data in fields ranging from social sciences to life, and physical sciences. DANS currently offers more than 200.000 datasets in their repository. It is possible to either deposit data or search for data in their field-specific data stations. When depositing data, you sign a deposit agreement, stating the terms of the licences for reuse. There are two access categories: open access (accessible to everyone, no login required) and restricted access (with your prior consent users may view and download data). After publication by DANS, the dataset receives a persistent identifier and will be permanently archived and accessible to others. When downloading a dataset from the repository, users have to accept a data use agreement and choose a license (Fig. 3).

Figure 3.1 - DANS repository - choosing a license.

Data curation

Data curation, in other words, means “cleaning up your data for them to be used for any purpose”. After data collection, the process of data curation can be broken down into several steps:

  • Appraisal: Select the appropriate data Enter, digitise, transcribe, check, validate, clean, and (pseudo)anonymise the data
  • Disposal: Discard data based on the appraisal step Archive, transfer ownership of, or destroy the data
  • Description: Write documentation and metadata, choose a licence
  • Preservation and access: Keep the dataset and its descriptions in a data repository based on the sharing restrictions of the data (restricted/open access)
  • Transformation: Process and analyse the data to generate new information

More information on these data curation steps and tips to help you curate your data can be found here.

Using Open Datasets

Besides repositories such as DANS - where researchers can share their datasets collected for specific projects and research questions - there are some large open datasets out there which are collected for the purpose of being used by anyone interested. These datasets generally contain a wide range of variables (demographic, clinical, lab, neuroimaging) that are less specific, but collected in many participants. Some can be accessed for free, some require a membership for which the costs can range between a couple hundred to ten thousand euros.

Wikipedia hosts an extensive list of available open datasets related to neuroscience here. We summarised a couple of well-known open datasets below:

Platform Links Description
OpenNeuro Main page: https://openneuro.org/ Documentation: https://docs.openneuro.org/ Search Engine: https://openneuro.org/search A free and open platform for validating and sharing BIDS-compliant MRI, PET, MEG, EEG, and iEEG data. Contains 994 public datasets of 40,098 participants
ADNI: Alzheimer’s Disease Neuroimaging Initiative https://adni.loni.usc.edu/ Sharing Alzheimer’s researchdata with the world (MRI, PET, genetics, cognitive tests, CSF and blood biomarkers). Contains three large datasets of MCI, AD, and elderly controls
HCP: Human Connectome Project http://www.humanconnectomeproject.org/ The Human Connectome Project provides an compilation of neural data, an interface to graphically navigate this data and the opportunity to achieve never before realised conclusions about the living human brain.
UK Biobank https://www.ukbiobank.ac.uk/ The UK BioBank hosts a uniquely powerful biomedical database that can be accessed globally by approved researchers. It contains de-identified data from half a million UK Biobank participants with neuroimaging, genetic, clinical, cognitive, and biological sample data.
More coming soon    

References

Krishnankutty, B., Bellary, S., Kumar, N. B., & Moodahadu, L. S. (2012). Data management in clinical research: an overview. Indian journal of pharmacology, 44(2), 168-172. https://bids-specification.readthedocs.io/en/stable/common-principles.html#source-vs-raw-vs-derived-data

Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., … & Poldrack, R. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific data, 3(1), 1-9.