-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #18 from t4d-gmbh/4-versioning-vs-reproducibility-1
complete rewrite of version vs repro section
- Loading branch information
Showing
19 changed files
with
300 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
## How <i class="fab fa-git"></i> Can Enhance Reproducibility | ||
{% if slide %} | ||
- **Version Control**: Track and version your analysis scripts with <i class="fab fa-git"></i>. | ||
|
||
- **Dependency Management**: | ||
- Add files like `renv.lock` or `pyproject.toml` to pin dependencies. | ||
|
||
- **Documentation**: | ||
- Use a `README.md` to outline installation, running instructions, and workflow insights. | ||
|
||
- **Parameterization**: | ||
- Include a parameter file (e.g., `.YAML` or `.json`) to document settings and load them in your script. | ||
{% else %} | ||
By using <i class="fab fa-git"></i>, you can effectively track and version the scripts associated with your computational analysis. | ||
You can easily add a simple text file (such as `renv.lock` or `pyproject.toml`) that contains pinned direct dependencies to your repository, allowing it to be tracked alongside the rest of your code. | ||
This establishes a solid foundation for managing **dependencies** directly with <i class="fab fa-git"></i>. | ||
|
||
In a similar fashion, utilizing a standard `README.md` file enables you to document the installation process for the declared dependencies, instructions for running the analysis script, and provide insights into how the **workflow** of the analysis is structured and should be executed. | ||
|
||
Moreover, you can incorporate a parameterization file (e.g., a `.YAML` or `.json` file) into the repository that outlines the parameters utilized in the analysis. | ||
Ideally, you should modify the analysis script to automatically load all necessary parameters from this file. | ||
This approach allows you to document and track the **configuration settings** employed in your analysis, including random number generator seeds, hyperparameters, and more. | ||
{% endif %} |
19 changes: 19 additions & 0 deletions
19
source/content/versioning_vs_reproducibility/git_helps_even_more.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
## ... and More! | ||
{% if slide %} | ||
- **Independent Development**: | ||
- Combining scripts and datasets in a single repository may not be optimal, as they can evolve separately, particularly datasets that may be utilized in various studies. | ||
|
||
- **Utilizing Git Submodules**: | ||
- Leverage `git submodule` to effectively connect multiple repositories. | ||
|
||
- **Version Tracking**: | ||
- With <i class="fab fa-git"></i> submodules, you can clearly specify the version of the data used for each version of your analysis. | ||
{% else %} | ||
To effectively track both your analysis scripts and the analyzed data (using <i class="fab fa-git"></i> LFS), it's essential to establish a connection between these two repositories. | ||
While it's possible to combine scripts and data into a single repository, this approach may not be ideal since both the dataset and the analysis scripts can evolve independently. | ||
This is especially true for datasets, which may be utilized in various studies. | ||
|
||
Fortunately, <i class="fab fa-git"></i> offers `git submodule`, a feature designed specifically for linking git modules. | ||
|
||
By using <i class="fab fa-git"></i> submodules, you can seamlessly and clearly connect multiple repositories, allowing you to specify the exact version of the data used in each version of your analysis. | ||
{% endif %} |
15 changes: 15 additions & 0 deletions
15
source/content/versioning_vs_reproducibility/git_helps_more.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
## <i class="fab fa-git"></i> Can Do More! | ||
{% if slide %} | ||
- **Handling Large Files**: | ||
- <i class="fab fa-git"></i> is not ideal for tracking large binary files. | ||
|
||
- **Git LFS Extension**: | ||
- Use <i class="fab fa-git"></i> LFS to efficiently track and version larger datasets. | ||
|
||
- **Data Availability**: | ||
- Contribute to the **availability of analyzed data** by publishing the analyzed data along with the exact dataset version used in your analysis. | ||
{% else %} | ||
While tracking bigger files, an binary files in particular, is not <i class="fab fa-git"></i>'s strong suite there exists an extension, called <i class="fab fa-git"></i> LFS, that efficiently makes up for this shortcoming. | ||
|
||
With <i class="fab fa-git"></i> LFS you can efficiently track and version even bigger dataset and thus contribute to the **availability of the analysed data** by publishing analyzed data, along with the exact version of the dataset that you used for your analysis. | ||
{% endif %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
### <i class="fab fa-git"></i> | ||
|
||
- **Data Availability** | ||
- **Workflow Documentation** | ||
- **Dependencies** | ||
- **Transitive Dependencies** | ||
- **Execution Environment** | ||
- **Configuration Settings** | ||
|
||
#### 1. **Data Versioning** | ||
Git is optimized for versioning code, which typically consists of small text files. However, scientific projects often involve large datasets, which Git cannot handle efficiently. Large files or datasets can cause performance issues and aren't inherently tracked in Git. | ||
|
||
#### 2. **Environment Consistency** | ||
Even if you version your code, it's crucial to ensure that the software environment (e.g., dependencies, libraries, and versions of tools) used to run that code is the same when the project is re-run. Git does not inherently track environment dependencies, and subtle differences in the software stack can lead to different results. | ||
|
||
#### 3. **Execution Order and Workflow** | ||
Git manages code changes but doesn’t track the workflow or the order in which scripts are run. A clear workflow is essential for reproducibility to ensure that others know how to replicate the results step by step. | ||
|
||
#### 4. **Data and Code Separation** | ||
Data is often stored separately from the code, either due to size or organizational reasons. While Git can track submodules (other repositories), it doesn't automatically handle the separation and re-aggregation of data repositories and analysis code. This separation can hinder reproducibility if not properly managed. | ||
|
||
#### 5. **Automation** | ||
Reproducibility also involves automating the execution of analysis, including triggering code execution when data or code changes. Git does not have native automation capabilities for running workflows. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,11 @@ | ||
### Bringing It All Together: The Path to Reproducibility | ||
### Bringing It All Together: Enhancing Reproducibility | ||
|
||
To go from basic version control to full reproducibility, you need: | ||
|
||
1. **Data versioning**: Use Git LFS to track large datasets. | ||
2. **Environment tracking**: Define environments and dependencies in CI/CD scripts. | ||
3. **Combining code and data**: Use Git submodules to link data repositories with analysis code. | ||
4. **Workflow automation**: Use GitHub Actions or GitLab CI/CD to automate the execution of workflows. | ||
5. **Documentation**: Provide clear documentation of how to set up, execute, and reproduce the entire analysis. | ||
|
||
:::{admonition} 6. Traceability | ||
:class: note | ||
Use issues, merge/pull requests, and news feeds to track the development and reasoning behind code changes. | ||
::: | ||
By leveraging these tools, Git-based repositories, together with remote services like GitHub and GitLab, can support reproducible workflows for scientific analysis, data science projects, and complex research collaborations. | ||
|
||
1. **Documentation**: Include documentation into the repository (`README.md` or `docs/`) | ||
1. **Data Availability**: Publish data! Use <i class="fab fa-git"></i> LFS for versioning | ||
1. **Workflow Documentation**: Use <i class="fab fa-git"></i> submodules and automation scripts to declare the full analysis workflow. | ||
1. **Dependencies**: Specify direct dependencies. | ||
1. **Transitive Dependencies**: Define isolated execution environments. | ||
1. **Environment tracking**: Use isolation tools like Docker or ✨NixOS✨. | ||
1. **Configuration Settings**: Declare and load configurations. |
36 changes: 36 additions & 0 deletions
36
source/content/versioning_vs_reproducibility/remote_services_help.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
## How Remote Services Can Enhance Reproducibility | ||
{% if slide %} | ||
:::{tip} | ||
:class: margin | ||
Checkout [git-and-its-remotes](https://t4d-gmbh.github.io/git-and-its-remotes) and [ci-cd-workflows](https://t4d-gmbh.github.io/ci-cd-workflows) to learn about benefits of remote services like <i class="fab fa-github"></i> **GitHub** and <i class="fab fa-gitlab"></i> **GitLab**. | ||
::: | ||
- **Improved Accessibility**: | ||
- Remote services enhance accessibility, allowing researchers to easily share their work with the scientific community. | ||
|
||
- **Collaboration Tools**: | ||
- Features like Issues and Merge/Pull Requests significantly improve documentation. | ||
- Effective for formalizing [features and tracking their implementation](https://t4d-gmbh.github.io/git-and-its-remotes/content/project_management/index.html#feature-branch-approach-reloaded). | ||
|
||
- **Automation**: | ||
- Enables automated processes triggered by various events within a Repository or Project. | ||
- Continuous Deployment (CD) is crucial for reproducibility in scientific analyses. | ||
|
||
- **Automation Scripts**: | ||
- Can declare the environment which script runs and the corresponding data versions used. | ||
- Provide comprehensive documentation of analyses, including data, scripts, dependencies, and execution environments. | ||
{% else %} | ||
If you've explored [git-and-its-remotes](https://t4d-gmbh.github.io/git-and-its-remotes) and [ci-cd-workflows](https://t4d-gmbh.github.io/ci-cd-workflows), you may have some insights into how remote services like <i class="fab fa-github"></i> **GitHub** and <i class="fab fa-gitlab"></i> **GitLab** can improve the reproducibility of scientific analyses. | ||
|
||
One of the most significant contributions of these platforms is improved accessibility. | ||
Researchers can easily share their work, making it more available to others in the scientific community. | ||
|
||
Collaboration tools such as Issues and Merge/Pull Requests play a significant role in enhancing documentation for an analysis. | ||
These tools are particularly effective when used to formalize [features and track their implementation](https://t4d-gmbh.github.io/git-and-its-remotes/content/project_management/index.html#feature-branch-approach-reloaded). | ||
|
||
Another important aspect of remote services is automation, which allows for automated processes to be triggered by various events within a Repository or Project. | ||
Continuous Deployment (CD) is especially relevant to the reproducibility of scientific analyses: | ||
|
||
Automation scripts can do more than just deploy a website; they can specify which scripts to run and the corresponding versions of the data used. | ||
|
||
Since automation scripts define the conditions for their execution and are part of the repository, they can comprehensively document and declaratively specify an analysis, detailing everything from the data utilized to the analysis scripts, their dependencies, and the specific environment in which the scripts were executed. | ||
{% endif %} |
16 changes: 16 additions & 0 deletions
16
source/content/versioning_vs_reproducibility/reproduce/config_settings.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
{% if slide %}####{%else%}:::{card}{%endif%} Configuration Settings | ||
Involves documenting parameters and settings that guide the analysis, which is essential for reproducing the same results. | ||
|
||
:::{note} | ||
This includes **randomness control**, i.e. the usage and specification of seeds whenever possible. | ||
::: | ||
|
||
**How?** | ||
|
||
- Declare **all** configuration parameter **separately** from your analysis scripts! | ||
- Use simple and readable formats (e.g. `.YAML` or `.json`). | ||
- Track the configuration files in the same repository as your analysis scripts. | ||
|
||
{%if page %}:::{%endif%} | ||
|
||
|
14 changes: 14 additions & 0 deletions
14
source/content/versioning_vs_reproducibility/reproduce/data_availability.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{% if slide %}####{%else%}:::{card}{%endif%} Data Availability | ||
Data needs to be accessible and its usage and preprocessing properly documented. | ||
|
||
**How?** | ||
|
||
- **Version Control**: | ||
- Use <i class="fab fa-git"></i> Git to track changes in your data. | ||
- Manage large datasets efficiently with <i class="fab fa-git"></i> Git Large File Storage. | ||
|
||
- **Publish Data** _if possible_: | ||
- Use platforms like Zenodo for data sharing. | ||
- Share the DOI and links to increase visibility. | ||
|
||
{%if page %}:::{%endif%} |
10 changes: 10 additions & 0 deletions
10
source/content/versioning_vs_reproducibility/reproduce/dependencies.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{% if slide %}####{%else%}:::{card}{%endif%} Dependencies | ||
Refers to the libraries and frameworks used in the analysis. | ||
Specifying exact versions helps control for changes that could affect results. | ||
|
||
**How?** | ||
|
||
- Include dependency decralrations in your <i class="fab fa-git"></i> repository. | ||
- **Pin** dependencies rather than minimal requirements. | ||
|
||
{%if page %}:::{%endif%} |
9 changes: 9 additions & 0 deletions
9
source/content/versioning_vs_reproducibility/reproduce/documentation.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{% if slide %}####{%else%}:::{card}{%endif%} Documentation | ||
Comprehensive documentation of the entire project, including the rationale behind decisions, methodologies used, and any challenges encountered, to facilitate understanding and reproducibility. | ||
|
||
**How?** | ||
- Use the `README.md` file as central location for your documentation of the content. | ||
- Make use of Issues and Merge/Pull Requests to declare what you are doing (Issue) and how you are doing it (Merge/Pull Request). | ||
|
||
|
||
{%if page %}:::{%endif%} |
9 changes: 9 additions & 0 deletions
9
source/content/versioning_vs_reproducibility/reproduce/exec_env.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{% if slide %}####{%else%}:::{card}{%endif%} Execution Environment | ||
Covers the operating system, hardware, and any relevant settings that could influence the analysis, ensuring that the environment is replicable. | ||
|
||
**How?** | ||
|
||
- Report hardware and hardware configuration | ||
|
||
{%if page %}:::{%endif%} | ||
|
18 changes: 18 additions & 0 deletions
18
source/content/versioning_vs_reproducibility/reproduce/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
## Where <i class="fab fa-git"></i> and its remote services can help | ||
|
||
{% if slide %} | ||
<!-- BUILDING THE SLIDES --> | ||
```{toctree} | ||
:maxdepth: 2 | ||
./documentation | ||
./data_availability | ||
./config_settings | ||
./dependencies | ||
./transitive_dependencies | ||
./workflow_doc | ||
./exec_env | ||
``` | ||
{% else %} | ||
<!-- Slides are imported in the parent folder! --> | ||
{% endif %} |
9 changes: 9 additions & 0 deletions
9
source/content/versioning_vs_reproducibility/reproduce/missing.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
### <i class="fab fa-git"></i> | ||
|
||
- **Data Availability** | ||
- **Workflow Documentation** | ||
- **Dependencies** | ||
- **Transitive Dependencies** | ||
- **Execution Environment** | ||
- **Configuration Settings** | ||
- **Precision Limitations** |
11 changes: 11 additions & 0 deletions
11
source/content/versioning_vs_reproducibility/reproduce/transitive_dependencies.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
{% if slide %}####{%else%}:::{card}{%endif%} Transitive Dependencies | ||
Addresses the indirect dependencies required by your primary libraries. | ||
Managing these ensures that all necessary components are accounted for. | ||
|
||
**How?** | ||
|
||
- Utilize isolated, temporary environments to execute your analysis. | ||
- Prefer declarative systems such as _NixOS_, or at the very least, use Docker. | ||
|
||
{%if page %}:::{%endif%} | ||
|
10 changes: 10 additions & 0 deletions
10
source/content/versioning_vs_reproducibility/reproduce/workflow_doc.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{% if slide %}####{%else%}:::{card}{%endif%} Workflow Documentation | ||
Clearly documenting the steps taken in the analysis to allow others to understand and replicate the process. | ||
|
||
**How?** | ||
|
||
- Document the execution workflow in an automation script. | ||
- Clarify how the version of the analysis scripts, as well as, the dataset are linked to the execution of the analysis. | ||
- Specify how the execution environment is build up. | ||
|
||
{%if page %}:::{%endif%} |
32 changes: 32 additions & 0 deletions
32
source/content/versioning_vs_reproducibility/reproducibility.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
## Reproducibility | ||
{% if slide %} | ||
|
||
- **Reproducibility**: Achieved when an analysis can be repeated using the same data, yielding the same results. | ||
- **Replicability**: Occurs when an analysis with different data on the same study subject leads to the same conclusions. | ||
|
||
> We focus on a strict interpretation of reproducibility: the same implementation of a method applied to the exact same data must produce the same result. | ||
- This approach is common in computer science and can be supported by tools like <i class="fab fa-git"></i> and related remote services. | ||
{% else %} | ||
|
||
Before we begin, it is important to clarify the definition of the term "reproducibility." | ||
|
||
In a scientific context, "reproducibility" and "replicability" are two terms that are often used interchangeably, but they can also refer to different concepts. | ||
|
||
In this discussion, we adopt the widely accepted interpretation that reproducibility is achieved when an analysis can be repeated using the same data and yields the same results. | ||
|
||
Replicability, on the other hand, refers to the situation where an analysis conducted with different data on the same study subject leads to the same conclusions. | ||
|
||
:::{note} | ||
We recommend reading [https://nap.nationalacademies.org/read/25303/chapter/6](https://nap.nationalacademies.org/read/25303/chapter/6) for an in-depth exploration of the subject. | ||
|
||
_National Academies of Sciences, et al. Reproducibility and replicability in science. National Academies Press, 2019._ | ||
::: | ||
|
||
Based on these definitions, our focus here is on reproducibility rather than replicability. | ||
Furthermore, we adopt a stricter interpretation of reproducibility than one might typically expect. | ||
While reproducibility could also imply that applying the same method to the same data leads to the same conclusion, we specifically mean that reproducibility is achieved when the same implementation of a method applied to the exact same data produces the same result. | ||
|
||
This strict interpretation of reproducibility is more common in computer science. | ||
We will demonstrate how <i class="fab fa-git"></i> and related remote services can be utilized to enhance the reproducibility of computational studies in this sense. | ||
{% endif %} |
Oops, something went wrong.