Main Content
How do I structure my data in a useful way?
Since the work process often produces not only numerous data sets, but also respective versions due to various modification stages, it is advisable to make uniform specifications for file naming and versioning. This increases work efficiency and promotes collaborative work processes. It also enables long-term traceability and re-usability of data.
Furthermore, it may be useful to define separate folder structures for raw data, analysis data and data evaluations as well as other project materials. For more information on data organisation, see Jensen 2012, pp.40-42 (PDF).
Information on “File and Folder Organisation” is provided in this short lecture by Christian Krippes.
Read-only versions should be created at the various stages of modification (e.g., original data, cleaned data, data ready for analysis). Further editing should only be done in copies of these master files.
Due to the respective specifics of the research areas, but also of the data itself, naming conventions can be designed quite differently. However, they should always take into account the type of data files (original data, cleaned files, analysis files) and also the respective file form (working file, results file, etc.).
The save date should be included in the file name, follow the YYYYMMDD format and be placed at the beginning or end of the file name to facilitate sorting. Refrain from using special characters and symbols as well as spaces and use underscores instead. The names should always be consistent, clear and conclusive.
Examples of file naming are (see also HU Berlin: Structuring files):
- [Sediment]_[Sample]_[Instrument]_[YYYYMMDD].dat
- [Experiment]_[Reagent]_[Instrument]_[YYYYMMDD].csv
- [Experiment]_[Experimental Design]_[Subject]_[YYYYMMDD].sav
- [Observation]_[Location]_[YYYYMMDD].mp4
- [interviewee]_[interviewer]_[YYYYMMDD].mp3
The file names listed here follow the so-called “pothole case” (also “snake case”) writing style. This can be recognised by the underscores used as hyphens. Usually, the letters after each underscore are written in lower case, but capitalising all initial letters after the underscores is also possible.
In addition to the Pothole Case, there are also numerous other writing styles for naming files. One of the most common is the Camel Case. The following example shows a file naming in the Camel Case: “SedimentSampleInstrumentYYYYMMDD.dat”. Each word begins with a capital letter, there is no separation by an underscore, as in the pothole case, or any other special character. The disadvantage of this naming convention is, for example, the specification of versions (see next paragraph). A version 1.0.0 would be recognisable in the Pothole Case as 1_0_0, in the Camel Case as 100. Regardless of which style you choose, you should make sure to use the same one throughout the project.
Changes to the data can be indicated by specifying the version in the file name. A well-known concept of versioning based on the DDI standard (Data Documentation Initiative) is: Major.Minor.Revision.
Starting from version “1.0.0”, the following changes are made:
- the first digit if cases, variables, waves or sample have been added or deleted
- the second digit, if data are corrected so that the analysis is affected
- the third digit, when simple revisions without relevance to meaning are made
Versioning can also be supported by appropriate software (e.g. Git).