Skip to content

Commit

Permalink
Merge pull request #11 from t4d-gmbh/8-ci-cd-for-reproducibility
Browse files Browse the repository at this point in the history
raw content - ci cd for reprod
  • Loading branch information
j-i-l authored Oct 24, 2024
2 parents be522e1 + 073ca2b commit 04b359a
Show file tree
Hide file tree
Showing 4 changed files with 340 additions and 3 deletions.
7 changes: 5 additions & 2 deletions source/content/ci_cd_for_repro/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@
```{toctree}
:maxdepth: 2
./slide1
./run_an_analysis
./using_docker
```
{% else %}
<!-- BUILDING THE PAGES -->
<!-- build the page content here -->
```{include} ./slide1.md
```{include} ./run_an_analysis.md
```
```{include} ./using_docker.md
```
{% endif %}
225 changes: 225 additions & 0 deletions source/content/ci_cd_for_repro/run_an_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
{% raw %}
## Using GitLab and GitHub CI/CD for Scientific Analysis

Both **GitLab** and **GitHub CI/CD** can be used for running scientific analyses, automating workflows, ensuring reproducibility, and enhancing collaboration. These tools offer automation capabilities tailored for complex, repetitive tasks, and can be customized to support various scientific applications.

---

### Why Use CI/CD for Scientific Analysis?

Scientific workflows often include processes that are ideal for automation, such as:

- **Data preprocessing**: Cleaning, normalizing, and structuring data.
- **Simulations**: Running computational models based on data or parameter sets.
- **Reproducibility**: Ensuring results can be reliably reproduced by others.
- **Collaboration**: Allowing collaborators to share and reuse workflows.

Both **GitLab** and **GitHub** pipelines help you:

- Automate repetitive tasks.
- Ensure experiments are performed in consistent environments.
- Track changes to both data and code for transparency.
- Collaborate and share results with ease.

---

### How GitLab CI/CD Can Be Used for Scientific Analysis

#### 1. Running Data Analysis Pipelines

In GitLab CI/CD, runners can execute data analysis scripts written in Python, R, or other languages.

**Example: Data Analysis Pipeline with Python**

```yaml
stages:
- data_preprocessing
- analysis

preprocess_data:
stage: data_preprocessing
script:
- python preprocess_data.py raw_data.csv cleaned_data.csv

run_analysis:
stage: analysis
script:
- python analyze_data.py cleaned_data.csv results.csv
```
#### 2. Running Simulations
You can set up GitLab pipelines to run simulations automatically whenever data or configurations change.
**Example: Running a Simulation in GitLab**
```yaml
stages:
- simulation

run_simulation:
stage: simulation
script:
- python run_simulation.py input_data.csv output_results.csv
```
#### 3. Using Docker for Reproducibility
With GitLab CI/CD, you can run jobs inside Docker containers to ensure reproducibility and consistent environments for scientific analyses.
**Example: Running a Job in a Docker Container in GitLab**
```yaml
stages:
- test

run_in_docker:
stage: test
image: python:3.9
script:
- pip install -r requirements.txt
- python analyze_data.py cleaned_data.csv
```
#### 4. Scheduling Scientific Workflows
GitLab CI/CD allows you to schedule recurring jobs (e.g., running analyses or simulations at regular intervals).
**Example: Scheduling a Daily Data Analysis Job in GitLab**
```yaml
stages:
- analysis

run_daily_analysis:
stage: analysis
script:
- python daily_analysis.py
only:
- schedules
```
---
### How GitHub Actions Can Be Used for Scientific Analysis
#### 1. Running Data Analysis Pipelines
GitHub Actions can automate the execution of data analysis workflows, triggered by events such as new data uploads or code changes.
**Example: Automating a Data Analysis Workflow in GitHub**
```yaml
name: Data Analysis

on: [push]

jobs:
preprocess:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.x
- name: Install dependencies
run: pip install -r requirements.txt
- name: Preprocess Data
run: python preprocess_data.py raw_data.csv cleaned_data.csv

analyze:
runs-on: ubuntu-latest
needs: preprocess
steps:
- name: Check out repository
uses: actions/checkout@v2
- name: Run Analysis
run: python analyze_data.py cleaned_data.csv results.csv
```
#### 2. Running Simulations
GitHub Actions can trigger simulations to run on GitHub-hosted runners or custom environments, useful for automating experiments.
**Example: Running a Simulation with GitHub Actions**
```yaml
name: Simulation Run

on: [push]

jobs:
run_simulation:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
- name: Run Simulation
run: python run_simulation.py input_data.csv output_results.csv
```
#### 3. Using Docker for Reproducibility
Like GitLab, GitHub Actions supports Docker containers, which can help ensure that analyses are performed in a consistent, reproducible environment.
**Example: Running a Dockerized Analysis in GitHub Actions**
```yaml
name: Dockerized Analysis

on: [push]

jobs:
analysis:
runs-on: ubuntu-latest
steps:
- name: Set up Docker
uses: docker/setup-buildx-action@v1
- name: Build and run Docker container
run: |
docker build -t analysis .
docker run analysis python analyze_data.py
```
#### 4. Scheduling Scientific Jobs
You can use GitHub Actions to schedule jobs that run periodically, such as weekly data analyses or simulations.
**Example: Scheduling a Weekly Job in GitHub Actions**
```yaml
name: Weekly Data Processing

on:
schedule:
- cron: "0 0 * * 0" # Every Sunday at midnight

jobs:
process_data:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v2
- name: Process Data
run: python process_data.py
```
---
### Benefits of Using GitLab/GitHub CI/CD for Scientific Analysis
- **Automation**: Eliminate manual execution of repetitive tasks such as data cleaning, analysis, or model training.
- **Reproducibility**: Use Docker containers and version control to ensure that all jobs run in the same environment, making it easier to replicate analyses.
- **Collaboration**: Collaborators can easily replicate, review, and contribute to workflows by accessing the pipelines.
- **Scalability**: Use custom or cloud-based runners to handle large, resource-intensive scientific workflows.
---
### Conclusion
Both **GitLab** and **GitHub CI/CD** are excellent tools for automating scientific analysis workflows, ensuring reproducibility, and improving collaboration. Whether you're running simulations, analyzing data, or automating machine learning workflows, CI/CD pipelines provide a powerful framework to streamline research and make it more robust, transparent, and scalable.
{% endraw %}
1 change: 0 additions & 1 deletion source/content/ci_cd_for_repro/slide1.md

This file was deleted.

110 changes: 110 additions & 0 deletions source/content/ci_cd_for_repro/using_docker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
{% raw %}
## Using GitLab/GitHub CI/CD Pipelines to Create and Distribute Docker Images

Both **GitLab** and **GitHub** provide robust CI/CD capabilities that can be leveraged to automate the creation and distribution of Docker images. Below is a high-level overview of how to set up CI/CD pipelines in both platforms for this purpose.

### 1. Prerequisites

#### Docker Installation
- Ensure that Docker is installed on the machine where the CI/CD runner will execute the jobs.

#### Docker Registry
- Set up a Docker registry to store your images. You can use:
- **Docker Hub**: A public registry for sharing images.
- **GitLab Container Registry**: A built-in private registry for GitLab users.
- **GitHub Container Registry**: A built-in registry for GitHub users.

### 2. Creating Docker Images

#### GitLab CI/CD

##### Step 1: Define the `.gitlab-ci.yml` File
Create a `.gitlab-ci.yml` file in the root of your repository to define the CI/CD pipeline. Here’s a basic example:

```yaml
stages:
- build
- push

build:
stage: build
image: docker:latest
services:
- docker:dind
script:
- docker build -t myapp:latest .

push:
stage: push
image: docker:latest
script:
- echo "$CI_REGISTRY_PASSWORD" | docker login -u "$CI_REGISTRY_USER" --password-stdin $CI_REGISTRY
- docker tag myapp:latest $CI_REGISTRY/mygroup/myapp:latest
- docker push $CI_REGISTRY/mygroup/myapp:latest
```
##### Step 2: Configure Variables
- Set up CI/CD variables in GitLab for `CI_REGISTRY`, `CI_REGISTRY_USER`, and `CI_REGISTRY_PASSWORD` to authenticate with your Docker registry.

#### GitHub Actions

##### Step 1: Define the Workflow File
Create a workflow file in the `.github/workflows` directory (e.g., `docker-build.yml`). Here’s a basic example:

```yaml
name: Build and Push Docker Image
on:
push:
branches:
- main
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Log in to Docker Hub
uses: docker/login-action@v1
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
- name: Build the Docker image
run: |
docker build -t myapp:latest .
- name: Push the Docker image
run: |
docker tag myapp:latest myusername/myapp:latest
docker push myusername/myapp:latest
```

##### Step 2: Configure Secrets
- In your GitHub repository settings, add secrets for `DOCKER_USERNAME` and `DOCKER_PASSWORD` to authenticate with your Docker registry.

### 3. Distributing Docker Images

#### Using Docker Registries
Once the Docker images are built and pushed to the registry, they can be easily distributed and pulled by other developers or deployment environments. Here’s how:

- **Pulling Images**: Users can pull the images from the registry using the `docker pull` command:

```bash
docker pull myusername/myapp:latest
```

- **Deployment**: The images can be deployed to various environments (e.g., staging, production) using orchestration tools like Kubernetes or Docker Compose.

### 4. Best Practices

- **Versioning**: Tag your Docker images with version numbers (e.g., `myapp:v1.0.0`) to keep track of changes and ensure reproducibility.
- **Automated Testing**: Include automated tests in your CI/CD pipeline to validate the Docker image before pushing it to the registry.
- **Security Scans**: Use tools to scan your Docker images for vulnerabilities before distribution.

### Conclusion

By leveraging the CI/CD capabilities of **GitLab** and **GitHub**, you can automate the process of creating and distributing Docker images. This not only streamlines your development workflow but also ensures that your applications are consistently built and deployed across different environments. 🚀🐳
{% endraw %}

0 comments on commit 04b359a

Please sign in to comment.