Data Organization
Motivation
5S methodology
“5S” (Wikipedia, 2024) is a workplace organisation method that uses a list of five Japanese words translated into English as: sort, set in order, shine, standardise and sustain. In the context of organising research data, ‘sort’ would refer to deleting unnecessary files. ‘Set in order’ would refer to developing and documenting naming conventions and folder structures. ‘Shine’ would refer to following conventions and developing routines. ‘Standardise’ would refer to documenting rules and responsibilities and developing best practices and standard operating procedures (SOPs). And ‘sustain’ would refer to regularly checking that rules are being followed and making improvements where necessary (Assmann et al., 2022).
Further resources on the 5S methodology
- The 5S Methodology in Research Data Management
- 5S Data: Setz dich auf deine 5 Buchstaben und organisiere deine Daten! (Coffee Lecture)
File naming
File naming convention
In order to maximise access to your data, to stay organised and to identify your files quickly, files and folders should be named in a meaningful and systematic way (LMA RDMWG, 2024a; Rehwald et al., 2022). A file naming convention provides a framework for naming your files and/or folders in a way that describes what they contain and how they relate to other files. This framework will help you, your future self, and others in a shared or collaborative group file-sharing environment to navigate your work more easily (LMA RDMWG, 2024a).
Thus, within your research group, we recommend (Biernacka et al., 2020; Bobrov et al., 2021; Bobrov et al., 2021; Bres et al., 2022) that you:
- Adopt a naming convention for files and folders.
- Document your file and folder naming convention.
- Stay consistent: the naming convention should be chosen in advance to ensure that it can be systematically followed and contains the same information (such as date and time) in the same order (e.g. yyyy-mm-dd) (Biernacka et al., 2020).
Criteria for a good naming convention
Avoid automatically generated names (e.g. from digital cameras) as they can lead to conflicting names due to repetition (Biernacka et al., 2020). A good naming convention produces file names that are human-readable, machine-readable, and play well with your system’s default ordering (Goldman, 2020a).
Human-readability
File names should be descriptive and provide just enough contextual information to establish a link to a particular experiment or data collection (Bobrov et al., 2021; LMA RDMWG, 2024a). To achieve this, you should choose names that reflect the content and are unique (Lindstädt et al., 2019).
Machine-readability
In some operating systems, long file paths can cause technical problems. Therefore, file names should be as long as necessary and as short as possible to keep them concise and readable on any operating system. It is recommended to limit file names to ≤ 32 characters (32CharactersLooksExactlyLikeThis.txt). Avoid using special characters (e.g. {}[]<>()* % # ‘ ; “ , : ? ! & @ $ ~), umlauts (ä, ö, ü, ß,…) or spaces (Biernacka et al., 2020). Periods should only be used before version numbers and file extensions, which should be preserved from the system (e.g. .ERL, .CSV, .TIF) (Lindstädt et al., 2019). You can use underscores (_), hyphens (-), or CamelCase instead to make file names both human- and machine-readable (LMA RDMWG, 2024a).
Play well with default ordering
The computer organises files by name, character by character. To browse your files easily, you should choose names that can be sorted alphabetically, numerically or chronologically to ensure that the files appear in a logical order. If you want chronological order, start with a date in ISO 8601 format (YYYY-MM-DD or YYYYMMDD) (Briney, 2020; Lindstädt et al., 2019). When using sequential numbering, make sure to use leading zeros. For a sequence of 1-10: 01-10 and for a sequence of 1-100: 001-010-100. Scalability should be taken into account (e.g. if a two-digit file number is chosen, the number of files is limited to 99) (Biernacka et al., 2020).
Name components that are already part of the folder name do not have to be repeated in the file names (Biernacka et al., 2020). Also consider the system under which the file is stored for later access and retrieval of the data.
Examples of file names
Below are some examples of file names that are human-readable (if you know the code/abbreviations), machine-readable, and properly sortable (Bres et al., 2022):
- 2016-01-04_ProjectA_Ex1Test1_SmithE_v1.0.xlsx
- 2000_USNM_379221_01.tiff
- USNM_379221_01.tiff
Here are some examples of file names that need improvement (Bres et al., 2022):
- Test data 2016.xlsx
- Meeting notes Jan 17.doc
- Notes Eric.txt
Tools for simultaneous renaming of files
Multiple OS
Linux
Mac
Unix
- mv command
Windows
- Advanced Renamer
- Altap Salamander
- Ant Renamer
- Bulk Rename Utility
- ExifToolGUI
- Rename-It!
- Total Commander
- WildRename
Further resources on file naming
- File naming examples (Table 1)
- Information and steps for creating naming conventions
- Information about file naming
- Information on File Naming and Folder Hierarchy
- Information and examples for microscopy data
- File Naming Convention Worksheet
- Worksheet for Naming and Organizing Files
- Checklist for FIle Naming Conventions
- A detailed documentation of a File Naming Convention
File versioning
Versioning or version control is the practice of tracking and managing changes to a file or set of files over time so that you can later retrieve specific versions.
We recommend that you meet with project partners to decide how versioning will be carried out, how version changes will be documented, and how a version change will be defined (Bres et al., 2022).
Purpose and use of versioning
Versioning helps you to keep a complete long-term change history of each file by tracking, tracing and annotating your steps (i.e. changes made to the file(s)) and also allows you to go back one step. Versioning also allows you to keep multiple versions of each file, and to create new versions of the same file - or even new results - by incorporating new data and/or changes to a file’s structure; this is particularly important in the case of software. Versioning also supports debugging in software. Overall, versioning makes your research easier to understand (Biernacka et al., 2020; Bres et al., 2022; Di Russo, 2020).
Version control methods and tools
Versioning can be done in the file name (see semantic versioning below), in the data (e.g. in the header or a column for comments), in a text file (e.g. in a README file), or using a version control system (VCS). A VCS is a software tool that helps to manage changes to one or more files over time. Examples of VCSs include Git (e.g. Bitbucket, GitHub, GitLab) and Apache Subversion (Git, n.d.). For collaborative document and storage locations (e.g. wiki, Google Docs, cloud), versioning is available in situ (Biernacka et al., 2020) (i.e. within the document/storage location and in real-time).
Apply versioning methods
Manual file versioning can be done using semantic versioning. You can do this by adding a “v” to the end of each file name, followed by a maximum of three numbers separated by a period (note that these are the only periods allowed in a file name other than the one before the extension). The first number is called MAJOR and indicates important changes. The second number is called MINOR and indicates less drastic changes. The third number is called PATCH, and is mainly used by software developers to indicate bug fixes, but could also be used when fixing typos. Examples of semantic versioning would look like this (Bobrov et al., 2021; Bres et al., 2022): Filename_vMAJOR.MINOR.PATCH.FileExtension
Ex1Test1_SmithE_v1.0.0.xlsx
Ex1Test1_SmithE_v1.2.5.xlsx
Ex1Test1_SmithE_v2.1.1.xlsx
If you decide to use manual file versioning, it is recommended that you use a version control table (a version control table template from the University of Sydney Library can be downloaded here). It is also recommended that you assign responsibilities for completing files, store milestone versions, and store obsolete versions separately after backup. How many versions of a file will be kept, which versions (e.g. major versions instead of minor versions (version 2.0 but not 2.1)), for how long, and how the versions will be organised need to be decided in advance (Biernacka et al., 2020), ideally with project partners.
Folder structure
To make it easier to find files, especially if you have a lot of data, you should avoid a chaotic or alphabetical approach to storing data. Instead, a proper folder structure is a hierarchical arrangement in which folders are created to make it easier to find data (Biernacka et al., 2020). A typical hierarchical folder structure has a root folder and several levels of subfolders. A carefully planned folder structure, with understandable folder names and an intuitive design, is the foundation of good data organisation. The folder structure provides an overview of what information can be found where, enabling both current and future contributors to understand what files have been produced in the project (Mičetić et al., n.d.).
General characteristics of an efficient folder structure
An efficient folder structure allows “someone”, perhaps your future self, to look at your files and immediately understand in detail what you have done and why (Goldman, 2020a). Therefore you should choose a folder structure that is hierarchical, clear, comprehensive, efficient and conclusive (Bres et al., 2022; Bres et al., 2023). To make it clear and comprehensive for other team members, make sure the structure is self-explanatory and has intuitive navigation (Biernacka et al., 2020; Bobrov et al., 2021; Bres et al., 2022). Short, meaningful folder names that follow a comprehensive naming convention make browsing a folder structure more efficient (Assmann et al., 2022; RDM Guide, n.d.). Sometimes it is a good idea to number the folders to ensure that they work well with the system’s default order (Assmann et al., 2022). For clarity, the folder structure should be identical on servers and local devices (Biernacka et al., 2020).
There is no one-size-fits-all solution: the optimal folder structure depends on the specific project requirements (Bres et al., 2023). To make the structure easy to browse, do not make it too deep: use a maximum of 3 to 4 levels (Bres et al., 2022; Voigt et al., 2022). In addition, if the folders are too large, it is difficult to find the right file in the folder: so limit your folders to a maximum of 10 items per folder (Bres et al., 2022).
There are several approaches to building your hierarchy. For some projects, it may be helpful to use a folder structure that follows the different parts or workflow of the project. This can support the step-by-step creation, analysis and publication of data (Biernacka et al., 2020; Bres et al., 2022). You can also consider basing your hierarchy on functionalities, people involved, date or time period, data types, creation methods or processing steps (Bres et al., 2023). Be careful to distinguish between (Schmid, 2021; Von der Dunk, 2021):
- Work vs. private material
- Own work vs. work of others (papers vs. literature)
- Research vs. administrative content
- Raw data vs. processed data vs. final data
- Experiment vs. analysis
- Experimental runs/replicates (where appropriate)
You should avoid using generic “current stuff” folders. Also, be careful about creating researcher-specific folders within a project: folders are about the content, not the authors (Pasquier, 2024). If you use researcher-specific folders, external contributors will not be able to understand what data is stored in these folders. Use one folder per dataset, containing data and its description. If you have multiple datasets, the project information can be described in the parent folder (Rehwald et al., 2022).
Make sure you don’t have overlapping categories, as you shouldn’t have copies of files in different folders, since this can lead to confusion and make it difficult to keep track of different versions of the file (Goldman, 2020a). If you need to see a file in more than one folder, you can use shortcuts to the file instead. This allows you to keep a single reference file (RDM Guide, n.d.). In particular, make sure you have a ‘raw data’ folder for each type of data or experiment (RDM Guide, n.d.). It is important to store your raw data separately so that the original versions of the files or their documentation are preserved and the original files can be reconstructed (Biernacka et al., 2020).
Example of folder structure
- Project
- Data
- Raw_data
- Processed_data
- Documentation
- Code
- Src
- Output
- Plots
- Documentation
- Protocols
- Data
- Manuscripts
- Conference_reports
- Administrative_information
Further resources on folder structure
- Checklist Directory Form
- Worksheet for Naming and Organizing Files and Folders
- Information on File Naming and Folder Hierarchy
Reusable folder structures
- GIN-Tonic
- Basic Folder Structure
- Folder Structure generator
- Template for research repositories
- Simple Open Data template
Tools
- Data Curation Tool (FAIR4Health)
- FAIRDOM: “Project space […] used by the community to organize, share and publish data, documents, literature and computational models, as well as to list contributors”
- G-Node Infrastructure (GIN) = Modern Research Data Management for Neuroscience (see Notes for more details)
References
- Wikipedia. (2024). 5S (methodology). http://en.wikipedia.org/w/index.php?title=5S%20(methodology)&oldid=1245207702
- Assmann, C., Gadelha, L., Markus, K., & Vandendorpe, J. (2022). Workshop on Research Data Management.
- LMA RDMWG. (2024a). File Naming Conventions. https://datamanagement.hms.harvard.edu/plan-design/file-naming-conventions
- Rehwald, S., Leimer, S., Lindstädt, B., Shutsko, A., & Vandendorpe, J. (2022). Workshop on Research Data Management in Medical and Biomedical Sciences.
- Biernacka, K., Bierwirth, M., Buchholz, P., Dolzycka, D., Helbig, K., Neumann, J., Odebrecht, C., Wiljes, C., & Wuttke, U. (2020). Train-the-Trainer Concept on Research Data Management. Zenodo. https://doi.org/10.5281/ZENODO.4071471
- Bobrov, E., Adam, L.-S., Söring, S., Jäckel, D., Herwig, A., Lindstädt, B., Vandendorpe, J., & Shutsko, A. (2021). Workshop on Research Data.
- Bres, E., Rudolf, D., Lindstädt, B., & Shutsko, A. (2022). Research Data Management in Medical and Biomedical Sciences.
- Goldman, J. (2020a). Organize Your Files [Checklist]. OSF. https://osf.io/fp9j5
- Lindstädt, B., Vandendorpe, J., & von der Ropp, S. (2019). Research Data Management.
- Briney, K. A. (2020). File Naming Convention Worksheet. California Institute of Technology.
- Di Russo, J. (2020). A Simple Story to Explain Version Control to Anyone. https://towardsdatascience.com/a-simple-story-to-explain-version-control-to-anyone-5ab4197cebbc
- Git. (n.d.). Getting Started - About Version Control. https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
- Mičetić, I., Popleteeva, M., Ahokas, M., Muhammad, N., Fuchs, S., & Kallberg, Y. (n.d.). Data organisation. RDMkit. https://rdmkit.elixir-europe.org/data_organisation
- Goldman, J. (2020a). Organize Your Files [Slides]. OSF. https://osf.io/yeqjv
- Bres, E., Rudolf, D., Lindstädt, B., Markus, K., Vandendorpe, J., & Riedel, R. (2023). Workshop on Research Data Management.
- RDM Guide. (n.d.). Folder structure. ELIXIR Belgium. https://rdm.elixir-belgium.org/folder_structure
- Voigt, P., Frericks, S., Lindstädt, B., Shutsko, A., & Vandendorpe, J. (2022). Workshop on Research Data.
- Schmid, F. (2021). Research data documentation - Best practices in filenaming and folder structure. https://ethz.ch/content/dam/ethz/associates/ethlibrary-dam/documents/Aktuell/Kurse/CoffeeLectures/2021-06-23_Coffee_Lecture_File-naming_final.pdf
- Von der Dunk, A. (2021). Lessons in Open Science - Best Practices in Personal RDM. https://www.slub-dresden.de/fileadmin/groups/slubsite/default_upload_automagic/2021-10-01_-__Lessons_in_Open_Science_-_Personal_RDM.pdf
- Pasquier, G. (2024). Folder Structure. Kathryn and Shelby Cullom Davis Library. https://libguides.graduateinstitute.ch/rdm/folders
Citation
National Research Data Infrastructure for Microbiota Research (NFDI4Microbiota). (2024, September 30). Data Organization. NFDI4Microbiota Knowledge Base.