From e11ea205839179da2be9b7a71120068c7e568cd2 Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Tue, 10 Sep 2024 15:18:28 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- data-organisation.html | 64 +++++++++++++++++++++--------------------- search.json | 6 ++-- sitemap.xml | 22 +++++++-------- 4 files changed, 47 insertions(+), 47 deletions(-) diff --git a/.nojekyll b/.nojekyll index 79e1ea0..36fbbe8 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -d317e32d \ No newline at end of file +081d2420 \ No newline at end of file diff --git a/data-organisation.html b/data-organisation.html index 20ff77f..16684c0 100644 --- a/data-organisation.html +++ b/data-organisation.html @@ -503,13 +503,13 @@

Folder structure

-
+
@@ -528,13 +528,13 @@

Folder structure

-
+
@@ -553,13 +553,13 @@

Folder structure

-
+
@@ -582,13 +582,13 @@

Folder structure

-
+
@@ -755,13 +755,13 @@

File and folder name

For further reading: GitHub recommends version names like ‘1.3.2’ for the releases of software products, details see Semantic Versioning 2.0.0.

-
+
@@ -788,13 +788,13 @@

File and folder name

-
+
@@ -820,13 +820,13 @@

File and folder name

-
+
@@ -853,13 +853,13 @@

File and folder name

-
+
@@ -997,13 +997,13 @@

File formats

-
+
@@ -1034,13 +1034,13 @@

File formats

-
+
@@ -1071,13 +1071,13 @@

File formats

-
+
@@ -1109,13 +1109,13 @@

File formats

-
+
@@ -1260,13 +1260,13 @@

+
@@ -1293,13 +1293,13 @@

+
@@ -1326,13 +1326,13 @@

+
@@ -1360,13 +1360,13 @@

+
diff --git a/search.json b/search.json index 23f7522..f557a22 100644 --- a/search.json +++ b/search.json @@ -26,7 +26,7 @@ "href": "data-organisation.html#folder-structure", "title": "1) Organization", "section": "Folder structure", - "text": "Folder structure\nAt the start of your research project, you have to decide how to arrange your files and folders. This decision depends on the structure of your data and documentation. Organizational choices may involve trade-offs, such as the number of files per folder versus folder depth, intuitive names versus strict naming conventions, and structuring by processing level, access permissions, file size, or other criteria.\n\n\n\n\n\n\nTask 1.1: (~ 5 minutes)\n\n\n\nLook at the folder structure (but not at the details of the folder or file names yet, which will be the next task).\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\nIn case there are no folders, you may discuss whether it would make sense to add folders.\n\n\n\n\n\n\n\n\nAvoid too long file paths\n\n\n\nDepending on the operating system, the total path length has an upper limit, e.g. 255 characters. Exceeding this limit will cause errors. Also note that the path of the copy may be even longer than your original path if you synchronize or backup your data, which can cause your sync or backup job to fail. Therefore, try to keep your full path clearly below such upper limits.\n\nBad example: X:/Projects/Microscopy_Project/Microscopy_Projects_2024/October_2024/RawData_October2024/Microscopy_RawData_Image003.tif\nBetter: X:/Projects/Microscopy/2024-10/RawData/Image003.tif\n\n\n\n\n\n\n\n\n\nFurther hints on folder structure\n\n\n\n\nAvoid deeply nested folder structures: SubSubSubSubSubFolders can be pretty inconvenient.\nAvoid too many files or subfolders within one folder:\nIt can be quite inconvenient to look through dozens of heterogeneous file names. In case of clearly structured file names (e.g. numbered files like Image003.tif or Plot01_Part03.tab), a larger number of elements per folder can also be fine. However, for huge amounts of files (several thousand), the performance of the file explorer may decrease.\nIn case different project members should have different access restrictions to files, this could also be considered in your folder structure.\n\n\n\n\n\n\n\n\n\nExamples\n\n\n\n\n\nExample for structuring a dataset: organized by file type1\n+ DatasetA\n + Data\n + Processed\n + Raw\n + Results\n + Figure1.tif\n + Figure2.tif\nExample for structuring a dataset: organized by analysis2\n+ DatasetB\n + Figure1\n + RawData\n + Results\n + Figure1.tif\n + Figure2\n + RawData\n + Results\n + Figure2.tif\nExample for a project folder structure3:\n+ Project_Folder\n + 1_Project_Management\n + Finance\n + Proposals\n + Reports\n + 2_Ethics_and_Governance\n + Consent_Forms\n + Ethical_Approvals\n + 3_Dissemination\n + Presentations\n + Publications\n + Publicity\n + Experiment_01\n + Data\n + Data_Analysis\n + Inputs\n + Outputs\nExample for a project folder structure4:\n\\\\file.mpic.de\\projects\\ExampleProject\\\n + GeneralOverview # General documentation of the project\n + Meetings # Meeting notes, presentations\n + INST # Instruments\n + Instrument1 # One folder per instrument\n + Doc # Documentation for this instrument\n + L_0 # Raw data\n + L_2 # Processed/analyzed data on original resolution\n + Product1 # One folder per data product\n + Code # Code used for creating the data of this product\n + Data # Data files of this product\n + Doc # Documentation for this data product\n + L_3 # Gridded data products\n + Product1 # One folder per data product (e.g. hourly averages)\n + Code # Code used for creating the data of this product\n + Data # Data files of this product\n + Doc # Documentation for this data product\n + Labbook # Labbook (photos of paper logbook or exports from ELN)\n\n\n\n\n\n\n\n\n\nSolution: Task 1.1, Group 1\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\nIn case there are no folders, you may discuss whether it would make sense to add folders.\n\nThe dataset has 42 files, but no folder structure. Folders are not needed here, because all files (except for the README file) are of same type, just for different months. However, one could make one subfolder per year.\n\n\n\n\n\n\n\n\n\nSolution: Task 1.1, Group 2\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\nIn case there are no folders, you may discuss whether it would make sense to add folders.\n\nThe dataset contains 6 files, whithout folder structure. However, 2 of them are of type ‘tar.gz’, which contain compressed ASCII files. The content is described in the README file. Also the tar.gz files do not contain many files, thus no further folder structure is needed.\n\n\n\n\n\n\n\n\n\nSolution: Task 1.1, Group 3\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\nFollowing notes relate to the content of OSF Storage.\n\nYes, files are grouped into data, code (scripts) etc.\nThe content is described in the README file, but not completely.\nThere are up to 8 files per folder, one folder level.\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.1, Group 4\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\nOn dataset-level, there is only one zip file, no folders. However, within the zip file, there is a folder structure:\n\nYes, intuitive structure: separation between data tables and scripts, …\nExplicitly described in README file.\nAround 3 to 4 folder levels. Folder Data/Group has 40 subfolders with 9 files each.", + "text": "Folder structure\nAt the start of your research project, you have to decide how to arrange your files and folders. This decision depends on the structure of your data and documentation. Organizational choices may involve trade-offs, such as the number of files per folder versus folder depth, intuitive names versus strict naming conventions, and structuring by processing level, access permissions, file size, or other criteria.\n\n\n\n\n\n\nTask 1.1: (~ 5 minutes)\n\n\n\nLook at the folder structure (but not at the details of the folder or file names yet, which will be the next task).\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\nIn case there are no folders, you may discuss whether it would make sense to add folders.\n\n\n\n\n\n\n\n\nAvoid too long file paths\n\n\n\nDepending on the operating system, the total path length has an upper limit, e.g. 255 characters. Exceeding this limit will cause errors. Also note that the path of the copy may be even longer than your original path if you synchronize or backup your data, which can cause your sync or backup job to fail. Therefore, try to keep your full path clearly below such upper limits.\n\nBad example: X:/Projects/Microscopy_Project/Microscopy_Projects_2024/October_2024/RawData_October2024/Microscopy_RawData_Image003.tif\nBetter: X:/Projects/Microscopy/2024-10/RawData/Image003.tif\n\n\n\n\n\n\n\n\n\nFurther hints on folder structure\n\n\n\n\nAvoid deeply nested folder structures: SubSubSubSubSubFolders can be pretty inconvenient.\nAvoid too many files or subfolders within one folder:\nIt can be quite inconvenient to look through dozens of heterogeneous file names. In case of clearly structured file names (e.g. numbered files like Image003.tif or Plot01_Part03.tab), a larger number of elements per folder can also be fine. However, for huge amounts of files (several thousand), the performance of the file explorer may decrease.\nIn case different project members should have different access restrictions to files, this could also be considered in your folder structure.\n\n\n\n\n\n\n\n\n\nExamples\n\n\n\n\n\nExample for structuring a dataset: organized by file type1\n+ DatasetA\n + Data\n + Processed\n + Raw\n + Results\n + Figure1.tif\n + Figure2.tif\nExample for structuring a dataset: organized by analysis2\n+ DatasetB\n + Figure1\n + RawData\n + Results\n + Figure1.tif\n + Figure2\n + RawData\n + Results\n + Figure2.tif\nExample for a project folder structure3:\n+ Project_Folder\n + 1_Project_Management\n + Finance\n + Proposals\n + Reports\n + 2_Ethics_and_Governance\n + Consent_Forms\n + Ethical_Approvals\n + 3_Dissemination\n + Presentations\n + Publications\n + Publicity\n + Experiment_01\n + Data\n + Data_Analysis\n + Inputs\n + Outputs\nExample for a project folder structure4:\n\\\\file.mpic.de\\projects\\ExampleProject\\\n + GeneralOverview # General documentation of the project\n + Meetings # Meeting notes, presentations\n + INST # Instruments\n + Instrument1 # One folder per instrument\n + Doc # Documentation for this instrument\n + L_0 # Raw data\n + L_2 # Processed/analyzed data on original resolution\n + Product1 # One folder per data product\n + Code # Code used for creating the data of this product\n + Data # Data files of this product\n + Doc # Documentation for this data product\n + L_3 # Gridded data products\n + Product1 # One folder per data product (e.g. hourly averages)\n + Code # Code used for creating the data of this product\n + Data # Data files of this product\n + Doc # Documentation for this data product\n + Labbook # Labbook (photos of paper logbook or exports from ELN)\n\n\n\n\n\n\n\n\n\nSolution: Example 1\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\nIn case there are no folders, you may discuss whether it would make sense to add folders.\n\nThe dataset has 42 files, but no folder structure. Folders are not needed here, because all files (except for the README file) are of same type, just for different months. However, one could make one subfolder per year.\n\n\n\n\n\n\n\n\n\nSolution: Example 2\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\nIn case there are no folders, you may discuss whether it would make sense to add folders.\n\nThe dataset contains 6 files, whithout folder structure. However, 2 of them are of type ‘tar.gz’, which contain compressed ASCII files. The content is described in the README file. Also the tar.gz files do not contain many files, thus no further folder structure is needed.\n\n\n\n\n\n\n\n\n\nSolution: Example 3\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\nFollowing notes relate to the content of OSF Storage.\n\nYes, files are grouped into data, code (scripts) etc.\nThe content is described in the README file, but not completely.\nThere are up to 8 files per folder, one folder level.\n\n\n\n\n\n\n\n\n\n\nSolution: Example 4\n\n\n\n\n\n\n\nIs the folder structure intuitive and logical (what is done, how, and why)?\nIs it explicitly described? Where can you find this information (metadata of repository or in a README file)?\nHow many files are stored per folder, and how deeply are they nested?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\nOn dataset-level, there is only one zip file, no folders. However, within the zip file, there is a folder structure:\n\nYes, intuitive structure: separation between data tables and scripts, …\nExplicitly described in README file.\nAround 3 to 4 folder levels. Folder Data/Group has 40 subfolders with 9 files each.", "crumbs": [ "About", "1) Organization" @@ -37,7 +37,7 @@ "href": "data-organisation.html#file-and-folder-names", "title": "1) Organization", "section": "File and folder names", - "text": "File and folder names\nIn the next section, we will explore best practices for file and folder naming to create a clear and organized data structure. File or folder names have the following primary purposes:\n\nAlways: Uniquely identify the file or folder (within a folder),\nOften: Give information about its content, e.g. README.txt, MeetingProtocol.docx, Temperature_RawData.tab,\nSometimes: Enable logical order when sorting alphabetically, e.g. 1_RawData, 2_PreProcessed, 3_Processed, 4_Combined.\n\nGenerally, the same rules apply to the naming of folders and files. They shall allow to choose the desired file amongst all the other files of the folder. Therefore, the names should be concise and intuitive (if applicable). For instance, a file named XYZ123 might not be immediately clear, so it’s important to explain its purpose somewhere, typically in a README file. Well-structured folders have clear naming conventions, which are explicitly described.\n\n\n\n\n\n\nTask 1.2: (~ 10 minutes)\n\n\n\n\nWhat naming convention is used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\n\n\n\n\n\n\nDo not use bad characters.\n\n\n\nDepending on the operating system and application, some characters are forbidden or may lead to problems and, thus, should be avoided.\n\nVery bad: Any non-ASCII character, e.g., öäüßµαδ°±•€→☺É\nBad: Any whitespace character, e.g. File 1.txt. They can cause problems, e.g., in some batch tasks, in particular, if one forgets to surround the name with quotes. Furthermore, double or multiple spaces and spaces at the beginning of the name are not clearly visible.\nForbidden in Windows: \\/:*?\"<>|\nAlso not recommended: ,;()[]{} etc.\n\nTo summarize: You should only use Latin letters A-Z, a-z, digits 0-9, underscore, hyphen and dot, i.e. following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz123456789_-.\nFurthermore, the dot should only be used in file names, and there only once before the file extension, e.g. “Notes.txt”. Some programs use a dot or underscore as the first character for special file types, e.g. _quarto.yml or .git and thus should be avoided for regular data files.\n\n\n\n\n\n\n\n\nNo ‘hello.txt’ and ‘Hello.txt’ in same folder.\n\n\n\nEnsure that subfolders and files have unique names within a folder, even in case-insensitive ways. For example, do not put two files named hello.txt and Hello.txt in the same folder.\nThis note is particularly relevant for Linux users, where putting both files in the same folder is possible. However, in Windows, that is not allowed. Thus, sharing such a folder between users of different operating systems would cause problems.\n\n\n\n\n\n\n\n\nExcursion: Ordering and timestamps\n\n\n\nA naming convention can enable a logical order of the file or folder names when sorting them alphabetically. Here, we provide some tips:\n\nWhen names include numbers, leading zeros are often helpful:\n\nOrdering with “0”:\nScan01.csv, Scan02.csv, Scan03.csv, Scan04.csv, Scan05.csv, Scan06.csv,\nScan07.csv, Scan08.csv, Scan09.csv, Scan10.csv, Scan11.csv, Scan12.csv\nOrdering without:\nScan1.csv, Scan10.csv, Scan11.csv, Scan12.csv, Scan2.csv, Scan3.csv,\nScan4.csv, Scan5.csv, Scan6.csv, Scan7.csv, Scan8.csv, Scan9.csv\n\nTimestamps should always be given with a leading zero and ‘from big to small’, i.e. year, month, day of month, hour, minute, second. This recommendation complies with the international format ISO 8601 (e.g. “2024-07-31”, “2024-07-31T2313”).\n\nVery bad: 13Jan2024, 21April2021, 3Dec2025\nAlso bad: 03122025, 13012024, 21042021\nGood: 2021-04-21, 2024-01-13, 2025-12-03\nAlso ok: 20210421, 20240113, 20251203\nIncluding time of day: 20210421T0345, 20240113T1730, 20251203T1900 for 03:45, 17:30, 19:00\n\n\n\n\n\n\n\n\n\n\nFurther good practice for file naming\n\n\n\n\nInclude relevant information in the file name. However, don’t misuse a file name as a way to store all your metadata.\nAvoid overly long names (a maximum of 32 characters is suggested). Mind also the previous note about the full path length.\nAvoid moving or renaming folders or files. This is especially relevant when you or others have referred to the file by using its file name or path.\nGenerate a README file explaining file nomenclature (including the meaning of acronyms or abbreviations), file organization and versioning. Store this file on top of the folder structure for easy accessibility.\n\nThere are different possibilities to indicate logical units in a name without using a whitespace:\n\nKebab-case: The-quick-brown-fox-jumps-over-the-lazy-dog.txt\nCamelCase: TheQuickBrownFoxJumpsOverTheLazyDog.txt\nSnake_case: The_quick_brown_fox_jumps_over_the_lazy_dog.txt\n\nCompromises often have to be made, such as including relevant information versus avoiding long names. Note that folder names with a precise and narrow meaning may become outdated when further content is filled in over time. Because of that, persistent identifiers (PID) typically avoid to include semantic information, e.g. doi:10.17617/3.1STIJV.\n\n\n\n\n\n\n\n\nExcursion: Versioning\n\n\n\nDocuments may evolve over time. File versioning allows for reverting to earlier versions if needed and shall allow for keeping track of changes, including documentation on the underlying rationale and people involved.\nVersion control can be done either manually by using naming conventions or by using a version control system like Git. The following hints apply to manual version control, meaning that you store both the current and previous versions in your file system.\n\nVersions should be numbered consecutively, e.g. Handbook_v3.pdf. Major changes (v1, v2, v3, …) can be distinguished from minor ones (v1-1, v1-2, v1-3 or 1a, 1b, 1c). You may use leading zeros if you expect more than nine versions.\nAlternatively, a date or timestamp could indicate the version, e.g. Handbook_v20240725.pdf.\nYou may use qualifiers such as “raw” or “processed” for data or “draft” or “internal” for documents. However, note that terms such as “final”, “final2”, “final-revised”, “final-changed_again”, and “final_ready” can be confusing. In other words: Avoid the word “final” in file names.\nDocument your versioning convention, e.g. what you mean with major or minor changes.\nDocument the essential changes you have made between the versions.\n\nFor further reading: GitHub recommends version names like ‘1.3.2’ for the releases of software products, details see Semantic Versioning 2.0.0.\n\n\n\n\n\n\n\n\nSolution: Task 1.2, Group 1\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\nFiles consist of prefix amb_hourly_qc_wc4.4_cal6.0_, followed by year and month (e.g. 2014_08), followed by _core-params.csv. Prefix might be intuitive for researchers of that field, but it is not explicitly described.\nProbably the prefix has some meaning, but as it is not explicitly stated in the README, we can only speculate.\nYes, files are sorted according to the month of measurement.\nYes: File name AMB hourly, readme.rtf contains spaces and a comma. The other file names contain several dots (should only be one dot, namely before the file extension csv).\nLength is not problematic, but longer than needed.\nReplace AMB hourly, readme.rtf by README.rtf. The other file names can be shortened, e.g. HourlyCoreParams_2014_08.csv. And if the file name prefix amb_hourly_qc_wc4.4_cal6.0_ contains relevant information, this should be explicitly given in the metadata or README file.\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.2, Group 2\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\nThe dataset contains only 6 files, thus there is not really a convention available, and also not needed. The non-intuitive parts wos and fo are explained in the README file (namely “Web of Science data” and “Faculty Opinions data”). The files inside the tar.gz-files seem to follow some convention, and their content is explicitly mentioned in the README file.\nProbably yes. Anyhow, their content is mentioned in the README file.\nOnly few files, does order not important.\nNo problematic characters found.\nLength of the names: OK\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.2, Group 3\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\nFollowing notes relate to the content of OSF Storage.\n\nNot so clear, but not many files, thus no clear conventions needed. However, the files in folder result are lacking an explanation in the README file, and their names are not very intuitive.\nMeaning not clear for all files.\nOnly few files, thus order not important.\nNo problematic characters found in OSF Storage.\nLength of the names: OK\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.2, Group 4\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\nThe subfolders of Data/Group and Data/Solo have names like 01_09_2022__10_13_33, which seem to refer to a date and maye time of day.\nYes, names are meaningful and intuitive.\nSubfolders are not in a chronological order, because the date is given in a disadvantageous format (e.g. 01_09_2022) - better would be 2022_09_01 or 2022-09-01.\nYes: Folder name Stan model code contains spaces.\nLength of the names: OK\nIf folder name 01_09_2022__10_13_33 stands for timestamp 2022-09-01T10:13:33, then it could be renamed to 20220901T101333 or 2022-09-01_101333.", + "text": "File and folder names\nIn the next section, we will explore best practices for file and folder naming to create a clear and organized data structure. File or folder names have the following primary purposes:\n\nAlways: Uniquely identify the file or folder (within a folder),\nOften: Give information about its content, e.g. README.txt, MeetingProtocol.docx, Temperature_RawData.tab,\nSometimes: Enable logical order when sorting alphabetically, e.g. 1_RawData, 2_PreProcessed, 3_Processed, 4_Combined.\n\nGenerally, the same rules apply to the naming of folders and files. They shall allow to choose the desired file amongst all the other files of the folder. Therefore, the names should be concise and intuitive (if applicable). For instance, a file named XYZ123 might not be immediately clear, so it’s important to explain its purpose somewhere, typically in a README file. Well-structured folders have clear naming conventions, which are explicitly described.\n\n\n\n\n\n\nTask 1.2: (~ 10 minutes)\n\n\n\n\nWhat naming convention is used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\n\n\n\n\n\n\nDo not use bad characters.\n\n\n\nDepending on the operating system and application, some characters are forbidden or may lead to problems and, thus, should be avoided.\n\nVery bad: Any non-ASCII character, e.g., öäüßµαδ°±•€→☺É\nBad: Any whitespace character, e.g. File 1.txt. They can cause problems, e.g., in some batch tasks, in particular, if one forgets to surround the name with quotes. Furthermore, double or multiple spaces and spaces at the beginning of the name are not clearly visible.\nForbidden in Windows: \\/:*?\"<>|\nAlso not recommended: ,;()[]{} etc.\n\nTo summarize: You should only use Latin letters A-Z, a-z, digits 0-9, underscore, hyphen and dot, i.e. following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz123456789_-.\nFurthermore, the dot should only be used in file names, and there only once before the file extension, e.g. “Notes.txt”. Some programs use a dot or underscore as the first character for special file types, e.g. _quarto.yml or .git and thus should be avoided for regular data files.\n\n\n\n\n\n\n\n\nNo ‘hello.txt’ and ‘Hello.txt’ in same folder.\n\n\n\nEnsure that subfolders and files have unique names within a folder, even in case-insensitive ways. For example, do not put two files named hello.txt and Hello.txt in the same folder.\nThis note is particularly relevant for Linux users, where putting both files in the same folder is possible. However, in Windows, that is not allowed. Thus, sharing such a folder between users of different operating systems would cause problems.\n\n\n\n\n\n\n\n\nExcursion: Ordering and timestamps\n\n\n\nA naming convention can enable a logical order of the file or folder names when sorting them alphabetically. Here, we provide some tips:\n\nWhen names include numbers, leading zeros are often helpful:\n\nOrdering with “0”:\nScan01.csv, Scan02.csv, Scan03.csv, Scan04.csv, Scan05.csv, Scan06.csv,\nScan07.csv, Scan08.csv, Scan09.csv, Scan10.csv, Scan11.csv, Scan12.csv\nOrdering without:\nScan1.csv, Scan10.csv, Scan11.csv, Scan12.csv, Scan2.csv, Scan3.csv,\nScan4.csv, Scan5.csv, Scan6.csv, Scan7.csv, Scan8.csv, Scan9.csv\n\nTimestamps should always be given with a leading zero and ‘from big to small’, i.e. year, month, day of month, hour, minute, second. This recommendation complies with the international format ISO 8601 (e.g. “2024-07-31”, “2024-07-31T2313”).\n\nVery bad: 13Jan2024, 21April2021, 3Dec2025\nAlso bad: 03122025, 13012024, 21042021\nGood: 2021-04-21, 2024-01-13, 2025-12-03\nAlso ok: 20210421, 20240113, 20251203\nIncluding time of day: 20210421T0345, 20240113T1730, 20251203T1900 for 03:45, 17:30, 19:00\n\n\n\n\n\n\n\n\n\n\nFurther good practice for file naming\n\n\n\n\nInclude relevant information in the file name. However, don’t misuse a file name as a way to store all your metadata.\nAvoid overly long names (a maximum of 32 characters is suggested). Mind also the previous note about the full path length.\nAvoid moving or renaming folders or files. This is especially relevant when you or others have referred to the file by using its file name or path.\nGenerate a README file explaining file nomenclature (including the meaning of acronyms or abbreviations), file organization and versioning. Store this file on top of the folder structure for easy accessibility.\n\nThere are different possibilities to indicate logical units in a name without using a whitespace:\n\nKebab-case: The-quick-brown-fox-jumps-over-the-lazy-dog.txt\nCamelCase: TheQuickBrownFoxJumpsOverTheLazyDog.txt\nSnake_case: The_quick_brown_fox_jumps_over_the_lazy_dog.txt\n\nCompromises often have to be made, such as including relevant information versus avoiding long names. Note that folder names with a precise and narrow meaning may become outdated when further content is filled in over time. Because of that, persistent identifiers (PID) typically avoid to include semantic information, e.g. doi:10.17617/3.1STIJV.\n\n\n\n\n\n\n\n\nExcursion: Versioning\n\n\n\nDocuments may evolve over time. File versioning allows for reverting to earlier versions if needed and shall allow for keeping track of changes, including documentation on the underlying rationale and people involved.\nVersion control can be done either manually by using naming conventions or by using a version control system like Git. The following hints apply to manual version control, meaning that you store both the current and previous versions in your file system.\n\nVersions should be numbered consecutively, e.g. Handbook_v3.pdf. Major changes (v1, v2, v3, …) can be distinguished from minor ones (v1-1, v1-2, v1-3 or 1a, 1b, 1c). You may use leading zeros if you expect more than nine versions.\nAlternatively, a date or timestamp could indicate the version, e.g. Handbook_v20240725.pdf.\nYou may use qualifiers such as “raw” or “processed” for data or “draft” or “internal” for documents. However, note that terms such as “final”, “final2”, “final-revised”, “final-changed_again”, and “final_ready” can be confusing. In other words: Avoid the word “final” in file names.\nDocument your versioning convention, e.g. what you mean with major or minor changes.\nDocument the essential changes you have made between the versions.\n\nFor further reading: GitHub recommends version names like ‘1.3.2’ for the releases of software products, details see Semantic Versioning 2.0.0.\n\n\n\n\n\n\n\n\nSolution: Example 1\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\nFiles consist of prefix amb_hourly_qc_wc4.4_cal6.0_, followed by year and month (e.g. 2014_08), followed by _core-params.csv. Prefix might be intuitive for researchers of that field, but it is not explicitly described.\nProbably the prefix has some meaning, but as it is not explicitly stated in the README, we can only speculate.\nYes, files are sorted according to the month of measurement.\nYes: File name AMB hourly, readme.rtf contains spaces and a comma. The other file names contain several dots (should only be one dot, namely before the file extension csv).\nLength is not problematic, but longer than needed.\nReplace AMB hourly, readme.rtf by README.rtf. The other file names can be shortened, e.g. HourlyCoreParams_2014_08.csv. And if the file name prefix amb_hourly_qc_wc4.4_cal6.0_ contains relevant information, this should be explicitly given in the metadata or README file.\n\n\n\n\n\n\n\n\n\n\nSolution: Example 2\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\nThe dataset contains only 6 files, thus there is not really a convention available, and also not needed. The non-intuitive parts wos and fo are explained in the README file (namely “Web of Science data” and “Faculty Opinions data”). The files inside the tar.gz-files seem to follow some convention, and their content is explicitly mentioned in the README file.\nProbably yes. Anyhow, their content is mentioned in the README file.\nOnly few files, does order not important.\nNo problematic characters found.\nLength of the names: OK\n\n\n\n\n\n\n\n\n\n\nSolution: Example 3\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\nFollowing notes relate to the content of OSF Storage.\n\nNot so clear, but not many files, thus no clear conventions needed. However, the files in folder result are lacking an explanation in the README file, and their names are not very intuitive.\nMeaning not clear for all files.\nOnly few files, thus order not important.\nNo problematic characters found in OSF Storage.\nLength of the names: OK\n\n\n\n\n\n\n\n\n\n\nSolution: Example 4\n\n\n\n\n\n\n\nWhat is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?\nAre the names meaningful? Are there misleading names?\nIn case of multiple files: Do they appear in a logical order when sorted alphabetically?\nAre there problematic characters like spaces, non-ASCII characters, etc.?\nWhat about the length of the names?\nDiscuss: What would you leave as it is, what would you change, or what are the alternatives?\n\n\n\nThe subfolders of Data/Group and Data/Solo have names like 01_09_2022__10_13_33, which seem to refer to a date and maye time of day.\nYes, names are meaningful and intuitive.\nSubfolders are not in a chronological order, because the date is given in a disadvantageous format (e.g. 01_09_2022) - better would be 2022_09_01 or 2022-09-01.\nYes: Folder name Stan model code contains spaces.\nLength of the names: OK\nIf folder name 01_09_2022__10_13_33 stands for timestamp 2022-09-01T10:13:33, then it could be renamed to 20220901T101333 or 2022-09-01_101333.", "crumbs": [ "About", "1) Organization" @@ -48,7 +48,7 @@ "href": "data-organisation.html#file-formats", "title": "1) Organization", "section": "File formats", - "text": "File formats\nA file format has to be chosen when storing information in a file. It builds the backbone of your data and is usually specified by the file extension (e.g. .txt). To keep your data interoperable, the format needs a clear structure. This makes your data easy to read with many software products (e.g., out-of-the-box solutions or by writing a small script). Clear documentation of the file format shall be publicly available. Considering all these aspects, the chance is high that the file can be read in future, making it suitable for long-term preservation - which is one of our main goals when managing data. Therefore, open file formats are recommended, while proprietary formats should be avoided.\nIdeally, when choosing a suitable format, you’ll consider the following properties:\n\nReadable by humans with a simple editor\nReadable with many programs\nEasy to understand, low complexity\nSmall (storage space)\nQuick to read (performance)\n\nHowever, usually compromises have to be made. For example, binary files are generally more performant than csv files and thus more suitable during the active research process. At the same time, csv is a well-established format for long-term preservation and is easier for humans to read.\n\n\n\n\n\n\nTask 1.3: (~ 10 minutes)\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\n\n\n\n\n\n\nAvoid proprietary formats.\n\n\n\nOften, proprietary formats have intentionally no proper documentation as the company behind the system wants to keep their business information behind closed doors. The companies sometimes even use technical protection mechanisms, making the file format readable only by commercial software. This reduces the interoperability and reusability of the files and, in the worst case, makes them unreadable in the long term. (Imagine the company that provided the software and file format no longer exists.) Furthermore, the files might contain hidden (potentially sensitive) information. Thus, such formats should be avoided.\n\n\n\n\n\n\n\n\nExamples of recommended formats\n\n\n\nIn the following list, you’ll find some formats which are widely used, well-documented and readable with several programs.\n\nFor documentation:\n\nPlain text (.txt)\nHTML, XHTML, Markdown\nPDF (PDF/A-1)\nmaybe: Rich Text Format (.rtf), Open Document Text (.odt), docx, …\n\nTabular data:\n\nComma-separated values (.csv)\nTab-delimited (.tab)\nmaybe: Open Document Spreadsheet (.ods), xlsx, …\n\nNested data:\n\nJSON\nXML\n\nFurther formats:\n\nNetCDF, HDF5, …\npng, jpg, …\n\n\nNotes:\n\nPDF: PDF has been developed by Adobe Inc. and thus originally had been a proprietary format, and several versions exist. Nevertheless, the format is widely used today. For archival purposes, a PDF/A version is the best choice. PDF is best suited for fixed documentation. However, editing PDF files or extracting data from them takes a lot of work.\nSpreadsheet files: Spreadsheets may look nice, particularly when formatted in a colourful way. But for the machine-readability, this can cause problems. In particular, we do not recommend that you present relevant information just by formatting content differently. You can take this as a rule of thumb: Spreadsheet files like .xlsx or .ods are not well machine-readable.\n\n\n\n\n\n\n\n\n\nExcursion: Premium format ASCII\n\n\n\n\n\nA gold standard for storing digital information is an ASCII file. In an ASCII file, each byte represents one visible character (except for the white spaces and control characters like tab stop and linebreaks).\nTherefore, ASCII files can be read or opened by any text editor or data-processing software, even with programs like Excel, Word, Wordpad or web browsers (only possibly limited regarding the file size).\nCharacters beyond ASCII:\nAn ASCII file can only contain the following visible characters: !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Otherwise, it is not an ASCII file.\nFor some years, the Unicode-based file format “UTF-8” has been available, which can represent many characters beyond the ASCII characters, like “ü”, “€”, and even some smilies ☺. Nowadays, UTF-8 is supported by many editors and browsers. The good thing about UTF-8 is that as long as a UTF-8 file contains only ASCII characters, the UTF-8 file is automatically an ASCII file. In other words, an ASCII file is a super-interoperable UTF-8 file.\n\n\n\n\n\n\n\n\n\nSolution: Task 1.3, Group 1\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\nFiles are ASCII files, thus open.\nYes, ASCII is suitable for long-term archiving.\nEasy to open, e.g. with text editor.\nFiles have tabular shape.\nOK, file sizes are below 1 MB.\nShape: easy to understand, meaning of the columns given in README file.\nMost data analysis programs have import functions for csv. The quotes in the first column might be cumbersome for some import routines.\nTab-separated files, spreadsheet files, etc.\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.3, Group 2\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\nThe small files are ASCII or UTF-8 files, thus open. The tar.gz files are compressed TAR-files, thus also in an open format.\nYes, ASCII is definitively suitable for long-term archiving. Also tar.gz files are widely used and can thus be considered suitable for long-term archiving.\nThe tar.gz files need specific software for extraction, which is freely available, but maybe not installed everywhere, and not all people are familiar with. Thus it is commandable that the extaction is described in the README file. However, the file size of several GB can be problematic for users having a slow internet connection. And unpacked, the largest file is more than 26 GB, more than the RAM size of many computers.\nThe data files (inside the tar.gz) are not complex, just tables.\nDue to compression, the file size is reduced for storage and download. However, the tables contain many digits, probably more than needed. Reducing them would decrease file size. Binary files instead of ASCII files would need less time for loading.\nShape: easy to understand, meaning of column see README file.\nMost data analysis programs have import functions for csv.\nBinary files like HDF, which could enhance performance.\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.3, Group 3\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\nFollowing notes relate to the content of “OSF Storage”.\n\nMost files are in an open format: ASCII tables, JSON files, R scripts. But what are “nii.gz” files in folder “results” - maybe zipped NIfTI files?\nYes for ASCII tables and JSON files; maybe yes for nii.gz files.\nASCII tables and JSON files: easy to open with every text editor, special software or libraries needed for nii.gz.\nFiles in folder data are tables (csv) or Codebooks (in JSON format) describing those.\nOK, because the files are not very large.\nASCII tables and JSON files are easy to understand by humans; nii.gz needs suitable software.\nMost data analysis programs have import functions for csv, also JSON import functions are available for several programs.\nFor csv-tables: Tab-separated files, spreadsheet files, etc; for JSON: XML\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.3, Group 4\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\nFiles are stored as ASCII tables or plain text files, which are open formats.\nYes, suitable for long-term archiving.\nEasy, readable with text editor.\nData files are ASCII tables.\nDue to compression, the file size is reduced for storage and download. Binary files instead of ASCII files would need less time for loading.\nThe format is easy to understand by humans, but the columns are not explicitly described.\nMost data analysis programs have import functions for semicolon-separated tables.\nBinary files like HDF could be used (cf note above related to performance).\n\n\n\n\n\nSpecial file types: tabular text file (optional)\nPlease note that the task in this section is optional. You can go through this section if you still have some time left during the workshop or read it afterwards.\nTabular text files store data in a structured format, where each row represents a record and each column represents a field, with data separated by a designated column separator. Even after deciding to store tabular data in text files (e.g. files which can be opened in any editor), there are various ways and conventions to choose from:\n\nColumn separator: typically tab or comma, sometimes space or semicolon\nNumeric values: handling of missing values (e.g. “NA”, ““, etc.)\nRepresentation of timestamps, e.g. “2024-08-01T08:59”\nHeader lines with meta information?\nEncoding: Recommended is ASCII or UTF-8\n\n\n\n\n\n\n\nTask 1.4: (~ 5 minutes)\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\n\n\n\n\n\n\nExample\n\n\n\n\n\nFirst, you will find an example of a very bad file, followed by an improved version.\nFile Measured last month.txt:\ndate, time,sensor,sensor\n03/07/24 12.00 AM,17.3\n03/07/24 1.00 AM,16.9\n03/07/24 2.00 AM,16.7\n03/07/24 3.00 AM,16.4\n03/07/24 4.00 AM,16.2\n03/07/24 5.00 AM,15.9\n03/07/24 6.00 AM\n03/07/24 7.00 AM\n03/07/24 8.00 AM\n03/07/24 9.00 AM,16.5\n03/07/24 10.00 AM,17.0,7.2\n03/07/24 11.00 AM,17.6,4.6\n03/07/24 12.00 PM,18.0\n03/07/24 1.00 PM,18.5\nWe gathered some comments on that file:\n\nFirst, we notice that the file name is bad. It contains spaces, and “last month” is no meaningful name (which month is considered as the actual one?).\nThat file is not a proper csv file because it does not have a proper tabular shape:\n\nThe header line indicates that we have 4 columns. When looking at the data, one can assume that there is one comma too much as the date and time of day are stored in one column.\nFurther, we have at most one comma leading into two columns in the data rows. This does not match the header. Therefore, we can assume that some values are missing.\n\nThe header line contains twice the word “sensor”. Thus, the column names are not unique.\nThe time column is horrible:\n\nThe date is given in an ambiguous format: you do not know if it is 03 July 2024 or 07 March 2024 or 24 July 2003, or 1924 or in the year 24 AD? You should use the international format ISO 8601, here “2024-03-07” for 7 March 2024.\nTime is given according to 12h clock with “AM” and “PM”. Hint: Never use “AM” or “PM” in a scientific context. Always use 24h clock!\nThe hour is noted without a leading zero.\nBetween hour and minute, a dot is used. It would be better to use a colon, e.g. “00:00”.\n\nImportant information is missing (but might be given in a separate README file or Codebook), e.g.:\n\nTime zone?\nWhat is the file about?\nWhy are values missing?\nUnit?\n\n\nAnd here is a better version: File Temp_Rain_202407.csv:\n# Averaged temperature and precipitation of Ex_Emplum station\n#\n# File created on 2024-04-22 by Schlaubi Schlumpf.\n# This file contains temperature and precipitation measured at the fictitious weather station 'Ex_Emplum' at 55.432 degrees North, 55.678 degrees East.\n# Raw data have been averaged over 1 hour.\n# NA indicates missing values due to measurement interruption or instrument malfunction.\n# Column description:\n# - Time: Start time of the 1-hour interval, given as UTC, in ISO 8601 format 'YYYY-MM-DDThh:mm'.\n# - Temp: Temperature at 2 m above ground level, averaged over the 1-hour interval, in degrees Celsius. The error of the given value is expected to be below 0.3 degrees Celsius.\n# - Rain: Precipitation height accumulated within the 1-hour interval, in mm. The error of the given value is expected to be below 0.5 mm.\n#\nTime,Temp,Rain\n2024-03-07T00:00,17.3,0\n2024-03-07T01:00,16.9,0\n2024-03-07T02:00,16.7,0\n2024-03-07T03:00,16.4,0\n2024-03-07T04:00,16.2,0\n2024-03-07T05:00,15.9,0\n2024-03-07T06:00,NA,NA\n2024-03-07T07:00,NA,NA\n2024-03-07T08:00,NA,NA\n2024-03-07T09:00,16.5,0\n2024-03-07T10:00,17.0,7.2\n2024-03-07T11:00,17.6,4.6\n2024-03-07T12:00,18.0,0\n2024-03-07T13:00,18.5,0\n\n\n\n\n\n\n\n\n\nSolution: Task 1.4, Group 1\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\nASCII\nHas many digits, e.g. 986.223944276841.\nNo information about missing values found in README file. But file amb_hourly_qc_wc4.4_cal6.0_2017_03_core-params.csv contains NA.\nTime is ISO 8601 conform, except that a space is given between date and time of day, e.g. 2017-03-23 09:30:00. In readme file mentioned: “All times given in GMT”.\nComma as column separator, whitespace only between date and time of day, no missing columns found.\nNot self-explaining but mentioned in README file, also the units.\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.4, Group 2\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\nUTF-8, except for the tar.gz. The files inside those tar.gz are even ASCII files.\nProbably more digits than needed, e.g. -13.333333333333336. Considering the file size, shortening them could be worthwile.\nNo information about missing values found in README file. But NA found in several files.\nTime: There seems to be no time column.\nTables: Comma as column separator, no missing columns or whitespaces found.\nNot self-explaining but columns mentioned in README file.\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.4, Group 3\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\nFollowing notes relate to the content of OSF Storage.\n\nEncoding: ASCII files (except for the nii.gz files)\nNumbers: e.g. 0.878519 - looks reasonable\nNo information about missing values found in README file. But NA found in several files.\nTime: There seems to be no time column.\nTables: Comma as column separator, no missing columns or whitespaces found.\nContent of the table is explained in JSON file (Codebook).\n\n\n\n\n\n\n\n\n\n\nSolution: Task 1.4, Group 4\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\nData tables are ASCII files.\nNumbers: e.g. 73.12958 - looks reasonable\nSpecial numbers or contents: Some Columns contain parenthesis - what is their meaning?\nTime: Time column with seconds(?) since start time?\nTables: Semicolon as column separator, also semicolon after last column.\nThe README file says “Variable names should be quite descriptive, but please get in touch in case anything is unclear”, but not all columns are so clear to understand.", + "text": "File formats\nA file format has to be chosen when storing information in a file. It builds the backbone of your data and is usually specified by the file extension (e.g. .txt). To keep your data interoperable, the format needs a clear structure. This makes your data easy to read with many software products (e.g., out-of-the-box solutions or by writing a small script). Clear documentation of the file format shall be publicly available. Considering all these aspects, the chance is high that the file can be read in future, making it suitable for long-term preservation - which is one of our main goals when managing data. Therefore, open file formats are recommended, while proprietary formats should be avoided.\nIdeally, when choosing a suitable format, you’ll consider the following properties:\n\nReadable by humans with a simple editor\nReadable with many programs\nEasy to understand, low complexity\nSmall (storage space)\nQuick to read (performance)\n\nHowever, usually compromises have to be made. For example, binary files are generally more performant than csv files and thus more suitable during the active research process. At the same time, csv is a well-established format for long-term preservation and is easier for humans to read.\n\n\n\n\n\n\nTask 1.3: (~ 10 minutes)\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\n\n\n\n\n\n\nAvoid proprietary formats.\n\n\n\nOften, proprietary formats have intentionally no proper documentation as the company behind the system wants to keep their business information behind closed doors. The companies sometimes even use technical protection mechanisms, making the file format readable only by commercial software. This reduces the interoperability and reusability of the files and, in the worst case, makes them unreadable in the long term. (Imagine the company that provided the software and file format no longer exists.) Furthermore, the files might contain hidden (potentially sensitive) information. Thus, such formats should be avoided.\n\n\n\n\n\n\n\n\nExamples of recommended formats\n\n\n\nIn the following list, you’ll find some formats which are widely used, well-documented and readable with several programs.\n\nFor documentation:\n\nPlain text (.txt)\nHTML, XHTML, Markdown\nPDF (PDF/A-1)\nmaybe: Rich Text Format (.rtf), Open Document Text (.odt), docx, …\n\nTabular data:\n\nComma-separated values (.csv)\nTab-delimited (.tab)\nmaybe: Open Document Spreadsheet (.ods), xlsx, …\n\nNested data:\n\nJSON\nXML\n\nFurther formats:\n\nNetCDF, HDF5, …\npng, jpg, …\n\n\nNotes:\n\nPDF: PDF has been developed by Adobe Inc. and thus originally had been a proprietary format, and several versions exist. Nevertheless, the format is widely used today. For archival purposes, a PDF/A version is the best choice. PDF is best suited for fixed documentation. However, editing PDF files or extracting data from them takes a lot of work.\nSpreadsheet files: Spreadsheets may look nice, particularly when formatted in a colourful way. But for the machine-readability, this can cause problems. In particular, we do not recommend that you present relevant information just by formatting content differently. You can take this as a rule of thumb: Spreadsheet files like .xlsx or .ods are not well machine-readable.\n\n\n\n\n\n\n\n\n\nExcursion: Premium format ASCII\n\n\n\n\n\nA gold standard for storing digital information is an ASCII file. In an ASCII file, each byte represents one visible character (except for the white spaces and control characters like tab stop and linebreaks).\nTherefore, ASCII files can be read or opened by any text editor or data-processing software, even with programs like Excel, Word, Wordpad or web browsers (only possibly limited regarding the file size).\nCharacters beyond ASCII:\nAn ASCII file can only contain the following visible characters: !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Otherwise, it is not an ASCII file.\nFor some years, the Unicode-based file format “UTF-8” has been available, which can represent many characters beyond the ASCII characters, like “ü”, “€”, and even some smilies ☺. Nowadays, UTF-8 is supported by many editors and browsers. The good thing about UTF-8 is that as long as a UTF-8 file contains only ASCII characters, the UTF-8 file is automatically an ASCII file. In other words, an ASCII file is a super-interoperable UTF-8 file.\n\n\n\n\n\n\n\n\n\nSolution: Example 1\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\nFiles are ASCII files, thus open.\nYes, ASCII is suitable for long-term archiving.\nEasy to open, e.g. with text editor.\nFiles have tabular shape.\nOK, file sizes are below 1 MB.\nShape: easy to understand, meaning of the columns given in README file.\nMost data analysis programs have import functions for csv. The quotes in the first column might be cumbersome for some import routines.\nTab-separated files, spreadsheet files, etc.\n\n\n\n\n\n\n\n\n\n\nSolution: Example 2\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\nThe small files are ASCII or UTF-8 files, thus open. The tar.gz files are compressed TAR-files, thus also in an open format.\nYes, ASCII is definitively suitable for long-term archiving. Also tar.gz files are widely used and can thus be considered suitable for long-term archiving.\nThe tar.gz files need specific software for extraction, which is freely available, but maybe not installed everywhere, and not all people are familiar with. Thus it is commandable that the extaction is described in the README file. However, the file size of several GB can be problematic for users having a slow internet connection. And unpacked, the largest file is more than 26 GB, more than the RAM size of many computers.\nThe data files (inside the tar.gz) are not complex, just tables.\nDue to compression, the file size is reduced for storage and download. However, the tables contain many digits, probably more than needed. Reducing them would decrease file size. Binary files instead of ASCII files would need less time for loading.\nShape: easy to understand, meaning of column see README file.\nMost data analysis programs have import functions for csv.\nBinary files like HDF, which could enhance performance.\n\n\n\n\n\n\n\n\n\n\nSolution: Example 3\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\nFollowing notes relate to the content of “OSF Storage”.\n\nMost files are in an open format: ASCII tables, JSON files, R scripts. But what are “nii.gz” files in folder “results” - maybe zipped NIfTI files?\nYes for ASCII tables and JSON files; maybe yes for nii.gz files.\nASCII tables and JSON files: easy to open with every text editor, special software or libraries needed for nii.gz.\nFiles in folder data are tables (csv) or Codebooks (in JSON format) describing those.\nOK, because the files are not very large.\nASCII tables and JSON files are easy to understand by humans; nii.gz needs suitable software.\nMost data analysis programs have import functions for csv, also JSON import functions are available for several programs.\nFor csv-tables: Tab-separated files, spreadsheet files, etc; for JSON: XML\n\n\n\n\n\n\n\n\n\n\nSolution: Example 4\n\n\n\n\n\n\n\nAre the files stored in an open or a proprietary format?\nIs the file format used “future-proof”, e.g., suitable for long-term archiving?\nHow easy is it to open the file (regarding available programs and file size)?\nHow complex are the files? What is their internal structure?\nWhat about performance and file size?\nHow easy is it to understand the file as humans?\nAre they machine-readable and standardized? How easy is it to write a script to read the files?\nWhich alternative formats exist?\n\n\n\nFiles are stored as ASCII tables or plain text files, which are open formats.\nYes, suitable for long-term archiving.\nEasy, readable with text editor.\nData files are ASCII tables.\nDue to compression, the file size is reduced for storage and download. Binary files instead of ASCII files would need less time for loading.\nThe format is easy to understand by humans, but the columns are not explicitly described.\nMost data analysis programs have import functions for semicolon-separated tables.\nBinary files like HDF could be used (cf note above related to performance).\n\n\n\n\n\nSpecial file types: tabular text file (optional)\nPlease note that the task in this section is optional. You can go through this section if you still have some time left during the workshop or read it afterwards.\nTabular text files store data in a structured format, where each row represents a record and each column represents a field, with data separated by a designated column separator. Even after deciding to store tabular data in text files (e.g. files which can be opened in any editor), there are various ways and conventions to choose from:\n\nColumn separator: typically tab or comma, sometimes space or semicolon\nNumeric values: handling of missing values (e.g. “NA”, ““, etc.)\nRepresentation of timestamps, e.g. “2024-08-01T08:59”\nHeader lines with meta information?\nEncoding: Recommended is ASCII or UTF-8\n\n\n\n\n\n\n\nTask 1.4: (~ 5 minutes)\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\n\n\n\n\n\n\nExample\n\n\n\n\n\nFirst, you will find an example of a very bad file, followed by an improved version.\nFile Measured last month.txt:\ndate, time,sensor,sensor\n03/07/24 12.00 AM,17.3\n03/07/24 1.00 AM,16.9\n03/07/24 2.00 AM,16.7\n03/07/24 3.00 AM,16.4\n03/07/24 4.00 AM,16.2\n03/07/24 5.00 AM,15.9\n03/07/24 6.00 AM\n03/07/24 7.00 AM\n03/07/24 8.00 AM\n03/07/24 9.00 AM,16.5\n03/07/24 10.00 AM,17.0,7.2\n03/07/24 11.00 AM,17.6,4.6\n03/07/24 12.00 PM,18.0\n03/07/24 1.00 PM,18.5\nWe gathered some comments on that file:\n\nFirst, we notice that the file name is bad. It contains spaces, and “last month” is no meaningful name (which month is considered as the actual one?).\nThat file is not a proper csv file because it does not have a proper tabular shape:\n\nThe header line indicates that we have 4 columns. When looking at the data, one can assume that there is one comma too much as the date and time of day are stored in one column.\nFurther, we have at most one comma leading into two columns in the data rows. This does not match the header. Therefore, we can assume that some values are missing.\n\nThe header line contains twice the word “sensor”. Thus, the column names are not unique.\nThe time column is horrible:\n\nThe date is given in an ambiguous format: you do not know if it is 03 July 2024 or 07 March 2024 or 24 July 2003, or 1924 or in the year 24 AD? You should use the international format ISO 8601, here “2024-03-07” for 7 March 2024.\nTime is given according to 12h clock with “AM” and “PM”. Hint: Never use “AM” or “PM” in a scientific context. Always use 24h clock!\nThe hour is noted without a leading zero.\nBetween hour and minute, a dot is used. It would be better to use a colon, e.g. “00:00”.\n\nImportant information is missing (but might be given in a separate README file or Codebook), e.g.:\n\nTime zone?\nWhat is the file about?\nWhy are values missing?\nUnit?\n\n\nAnd here is a better version: File Temp_Rain_202407.csv:\n# Averaged temperature and precipitation of Ex_Emplum station\n#\n# File created on 2024-04-22 by Schlaubi Schlumpf.\n# This file contains temperature and precipitation measured at the fictitious weather station 'Ex_Emplum' at 55.432 degrees North, 55.678 degrees East.\n# Raw data have been averaged over 1 hour.\n# NA indicates missing values due to measurement interruption or instrument malfunction.\n# Column description:\n# - Time: Start time of the 1-hour interval, given as UTC, in ISO 8601 format 'YYYY-MM-DDThh:mm'.\n# - Temp: Temperature at 2 m above ground level, averaged over the 1-hour interval, in degrees Celsius. The error of the given value is expected to be below 0.3 degrees Celsius.\n# - Rain: Precipitation height accumulated within the 1-hour interval, in mm. The error of the given value is expected to be below 0.5 mm.\n#\nTime,Temp,Rain\n2024-03-07T00:00,17.3,0\n2024-03-07T01:00,16.9,0\n2024-03-07T02:00,16.7,0\n2024-03-07T03:00,16.4,0\n2024-03-07T04:00,16.2,0\n2024-03-07T05:00,15.9,0\n2024-03-07T06:00,NA,NA\n2024-03-07T07:00,NA,NA\n2024-03-07T08:00,NA,NA\n2024-03-07T09:00,16.5,0\n2024-03-07T10:00,17.0,7.2\n2024-03-07T11:00,17.6,4.6\n2024-03-07T12:00,18.0,0\n2024-03-07T13:00,18.5,0\n\n\n\n\n\n\n\n\n\nSolution: Example 1\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\nASCII\nHas many digits, e.g. 986.223944276841.\nNo information about missing values found in README file. But file amb_hourly_qc_wc4.4_cal6.0_2017_03_core-params.csv contains NA.\nTime is ISO 8601 conform, except that a space is given between date and time of day, e.g. 2017-03-23 09:30:00. In readme file mentioned: “All times given in GMT”.\nComma as column separator, whitespace only between date and time of day, no missing columns found.\nNot self-explaining but mentioned in README file, also the units.\n\n\n\n\n\n\n\n\n\n\nSolution: Example 2\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\nUTF-8, except for the tar.gz. The files inside those tar.gz are even ASCII files.\nProbably more digits than needed, e.g. -13.333333333333336. Considering the file size, shortening them could be worthwile.\nNo information about missing values found in README file. But NA found in several files.\nTime: There seems to be no time column.\nTables: Comma as column separator, no missing columns or whitespaces found.\nNot self-explaining but columns mentioned in README file.\n\n\n\n\n\n\n\n\n\n\nSolution: Example 3\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\nFollowing notes relate to the content of OSF Storage.\n\nEncoding: ASCII files (except for the nii.gz files)\nNumbers: e.g. 0.878519 - looks reasonable\nNo information about missing values found in README file. But NA found in several files.\nTime: There seems to be no time column.\nTables: Comma as column separator, no missing columns or whitespaces found.\nContent of the table is explained in JSON file (Codebook).\n\n\n\n\n\n\n\n\n\n\nSolution: Example 4\n\n\n\n\n\n\n\nHow is the file encoded (e.g. ASCII, UTF-8)?\nNumbers: What about their precision (enough or too much)?\nSpecial numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?\nTime: Which format is used for the date and time of day? Which time zone is used?\nTables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?\nIs the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?\n\n\n\nData tables are ASCII files.\nNumbers: e.g. 73.12958 - looks reasonable\nSpecial numbers or contents: Some Columns contain parenthesis - what is their meaning?\nTime: Time column with seconds(?) since start time?\nTables: Semicolon as column separator, also semicolon after last column.\nThe README file says “Variable names should be quite descriptive, but please get in touch in case anything is unclear”, but not all columns are so clear to understand.", "crumbs": [ "About", "1) Organization" diff --git a/sitemap.xml b/sitemap.xml index ab2c89e..454a0d9 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,46 +2,46 @@ https://lmu-osc.github.io/FAIR-Data-Management/data-organisation.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.294Z https://lmu-osc.github.io/FAIR-Data-Management/index.html - 2024-09-10T15:11:17.843Z + 2024-09-10T15:18:11.306Z https://lmu-osc.github.io/FAIR-Data-Management/Example3_READMETemplate.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.290Z https://lmu-osc.github.io/FAIR-Data-Management/example-datasets.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.294Z https://lmu-osc.github.io/FAIR-Data-Management/Example2_READMETemplate.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.290Z https://lmu-osc.github.io/FAIR-Data-Management/Example1_READMETemplate.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.290Z https://lmu-osc.github.io/FAIR-Data-Management/documentation.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.294Z https://lmu-osc.github.io/FAIR-Data-Management/dmp.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.294Z https://lmu-osc.github.io/FAIR-Data-Management/Example4_READMETemplate.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.290Z https://lmu-osc.github.io/FAIR-Data-Management/publication.html - 2024-09-10T15:11:17.843Z + 2024-09-10T15:18:11.306Z https://lmu-osc.github.io/FAIR-Data-Management/about.html - 2024-09-10T15:11:17.831Z + 2024-09-10T15:18:11.294Z