feat: various changes to make building of catalog working again #30

m-jahn · 2024-12-21T10:27:28Z

catalog has not been building correctly for about 6 months
deployment has stopped for about 3 months
this PR attempts to fix several problems as outlined below
it built correctly now on my fork with the updated GH actions workflow
after merging, workflow catalog should build again properly

Main changes:

updated the micromamba GH workflow, and the environment.yml file
added a pull step before changes are pushed, because it led to a git error if repo was not up-to-date
changed building policy to be more resource-efficient: the cron job is now only triggered once per week (instead of daily), and only repos that have had changes in the past 7 days are parsed for updates -- this reduces build time from >24h to 10h and reduces risk of failing jobs
added function to remove github access tokens from snakemake --lint output which sometimes get exposed
updated the untar comment to fix warnings from python
added full update of all repos (not only the ones with latest changes)
added some documentation to the README.md

Summary by CodeRabbit

New Features
- Expanded the README.md to include new sections on workflow generation, deployment, and contributing guidelines.
- Enhanced the generate-catalog script to filter repositories based on last push date and improved error handling.
Bug Fixes
- Improved error handling in the catalog generation process and refined the logic for managing rate limits.
Documentation
- Added badges and detailed descriptions in the README.md for better visibility of repository status.
Chores
- Updated workflow configurations for scheduling and branch handling.
- Added new entries to the blacklist.txt file.
- Specified exact versions for dependencies in environment.yml.
- Updated deployment actions in workflow configurations.
- Cleaned up unused imports in scripts to streamline the codebase.

…remote

coderabbitai · 2024-12-21T10:27:37Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This pull request introduces updates to the Snakemake Workflow Catalog repository, focusing on workflow configuration, dependency management, and script improvements. The changes span multiple files, including GitHub workflow configurations, README documentation, environment specifications, and core scripts. Key modifications include updating deployment actions, refining repository search logic, enhancing rate limit handling, and specifying exact dependency versions. The updates aim to improve the catalog's reliability, maintainability, and precision in processing and documenting Snakemake workflows.

Changes

File	Change Summary
`.github/workflows/generate.yml`	- Updated workflow schedule to weekly on Mondays at 5 AM - Added `dev` branch to push triggers - Changed deployment action to `mamba-org/setup-micromamba@v2` - Added environment file parameter - Improved commit handling with error checking
`.github/workflows/test-repo.yml`	- Updated deployment action to `mamba-org/setup-micromamba@v2` - Refined repository name extraction from comments - Maintained rate limit and comment handling logic
`README.md`	- Added badges for workflow status - Expanded sections on catalog purpose, workflows, and usage - Included guidelines for contributions
`blacklist.txt`	- Added new repository entries: - `snakemake/snakemake-workflow-catalog` - `tdayris/fair_bowtie2_mapping` - `tdayris/fair_fastqc_multiqc` - `GiulioCentorame/FADS-by-breastfeeding`
`environment.yml`	- Specified exact versions for dependencies: - `jinja2=3.1.4` - `pygithub=2.5.0` - `gitpython=3.1.43` - `snakemake=8.25.5` - `snakefmt=0.10.2` - `python=3.12.8` - `ratelimit=2.2.1`
`scripts/common.py`	- Removed unused imports - Added `rate_limit_wait()` function - Enhanced rate limit handling in `call_rate_limit_aware()`
`scripts/generate-catalog.py`	- Added `latest_commit` variable - Updated repository search logic - Improved error handling and logging - Modified offset calculation - Enhanced repository processing

Sequence Diagram

sequenceDiagram
    participant Workflow as GitHub Workflow
    participant Script as generate-catalog.py
    participant GitHub as GitHub API
    participant Repo as Repository

    Workflow->>Script: Trigger catalog generation
    Script->>GitHub: Search repositories
    GitHub-->>Script: Return repository list
    Script->>Script: Apply date threshold
    Script->>Repo: Validate repository
    alt Repository meets criteria
        Script->>Repo: Process repository
        Script->>Workflow: Generate catalog entry
    else Repository skipped
        Script->>Workflow: Log skip reason
    end

This sequence diagram illustrates the high-level workflow of the catalog generation process, highlighting the key steps of repository search, validation, and processing.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a51238 and 4e86f3c.

📒 Files selected for processing (1)

scripts/generate-catalog.py (7 hunks)

Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

README.md (4)
26-31: Fix minor grammatical issues in the workflow criteria section.

Apply these corrections:
-The workflow is contained in a public Github repository.
+The workflow is contained in a public GitHub repository.
-very large files should be handled via [Git LFS]
+large files should be handled via [Git LFS]
-, so that they can be stripped out during cloning).
+ so that they can be stripped out during cloning).
🧰 Tools

🪛 LanguageTool

[misspelling] ~27-~27: This word is normally spelled with a hyphen.
Context: ...g the words "snakemake" and "workflow" (case insensitive). - The repository contains a workflow ...

(EN_COMPOUNDS_CASE_INSENSITIVE)

[style] ~30-~30: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ions/understanding-github-actions) job (very large files should be handled via [Git LFS](h...

(EN_WEAK_ADJECTIVE)

[typographical] ~30-~30: The conjunction “so that” does not have a comma in front.
Context: ...working-with-files/managing-large-files), so that they can be stripped out during cloning...

(SO_THAT_UNNECESSARY_COMMA)

35-39: Improve clarity in the standardized usage section.

Apply these corrections:
-In order to additionally appear in the "standardized usage" area, repositories additionally have to:
+To appear in the "standardized usage" area, repositories must:
55-55: Fix grammatical issues in the usage instructions.

Apply these corrections:
-Once included in the standardized usage area you can link
+Once included in the standardized usage area, you can link
-Hence, in order to update your workflow's records here,
+Hence, to update your workflow's records here,
Also applies to: 59-59

🧰 Tools

🪛 LanguageTool

[uncategorized] ~55-~55: Possible missing comma found.
Context: ...Once included in the standardized usage area you can link directly to the usage inst...

(AI_HYDRA_LEO_MISSING_COMMA)

80-80: Format the URL properly.

Convert the bare URL to a proper markdown link:
-For more information and to explore the available workflows, visit https://snakemake.github.io/snakemake-workflow-catalog/.
+For more information and to explore the available workflows, visit [the Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/).
🧰 Tools

🪛 Markdownlint (0.37.0)

80-80: null
Bare URL used

(MD034, no-bare-urls)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f63cd80 and d521f16.

📒 Files selected for processing (7)

.github/workflows/generate.yml (4 hunks)
.github/workflows/test-repo.yml (1 hunks)
README.md (1 hunks)
blacklist.txt (1 hunks)
environment.yml (1 hunks)
scripts/common.py (1 hunks)
scripts/generate-catalog.py (7 hunks)

✅ Files skipped from review due to trivial changes (1)

blacklist.txt

🧰 Additional context used

📓 Path-based instructions (2)

scripts/common.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

scripts/generate-catalog.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.8.2)

scripts/generate-catalog.py

14-14: Redefinition of unused Environment from line 12

Remove definition: Environment

(F811)

19-19: common.check_repo_exists imported but unused

Remove unused import: common.check_repo_exists

(F401)

275-275: Function definition does not bind loop variable config_readme

(B023)

🪛 LanguageTool

README.md

[misspelling] ~27-~27: This word is normally spelled with a hyphen.
Context: ...g the words "snakemake" and "workflow" (case insensitive). - The repository contains a workflow ...

(EN_COMPOUNDS_CASE_INSENSITIVE)

[style] ~30-~30: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ions/understanding-github-actions) job (very large files should be handled via [Git LFS](h...

(EN_WEAK_ADJECTIVE)

[typographical] ~30-~30: The conjunction “so that” does not have a comma in front.
Context: ...working-with-files/managing-large-files), so that they can be stripped out during cloning...

(SO_THAT_UNNECESSARY_COMMA)

[style] ~34-~34: Consider a shorter alternative to avoid wordiness.
Context: ...ere. ### Standardized usage workflows In order to additionally appear in the "standardize...

(IN_ORDER_TO_PREMIUM)

[uncategorized] ~55-~55: Possible missing comma found.
Context: ...Once included in the standardized usage area you can link directly to the usage inst...

(AI_HYDRA_LEO_MISSING_COMMA)

[style] ~59-~59: Consider a shorter alternative to avoid wordiness.
Context: ... the latest non-preview release. Hence, in order to update your workflow's records here, yo...

(IN_ORDER_TO_PREMIUM)

🪛 Markdownlint (0.37.0)

README.md

80-80: null
Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (15)

scripts/generate-catalog.py (6)

9-10: Imports for regex and datetime calculations look good.
These imports are necessary for filtering tokens and implementing the date threshold.

111-112: Date threshold-based repository search is a solid improvement.
Using “pushed:>=” in the search query helps filter stale repositories.

Also applies to: 114-115

252-252: Redacting GitHub tokens is a good security measure.
Ensure the regex is comprehensive enough to catch all possible token formats.

272-272: Topic retrieval logic is appropriate.
This step ensures relevant repository metadata is captured.

278-287: Repo object instantiation is consistent with the updated fields.
The logic for setting the usage flags, release information, readme content, etc., is clear and cohesive.

32-33: Verify the logic behind 'latest_commit' and scaled 'offset'.
• "latest_commit" is set from an environment variable to determine how far back you look for updates. Ensure that all related calls or references are aware of this new usage.
• Changing "offset" from its old calculation to "offset * 10" can drastically expand your repository search. Double-check that this multiplication aligns with your desired iteration logic.

✅ Verification successful

Let me search for references to LATEST_COMMIT to complete the verification.

Based on the analysis of both offset and LATEST_COMMIT usage in the codebase, I can now provide the final response:

The offset multiplication and latest_commit usage are correctly scoped
• The offset * 10 scaling is used consistently within generate-catalog.py for pagination, with proper bounds checking via min(offset + 100, total_count).
• LATEST_COMMIT is only used once to determine the search boundary and doesn't have any dependent calls that need updating.
• The offset value is properly initialized from environment variables in common.py and used appropriately across the cleanup-catalog.py script.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching all references to 'offset' in the codebase to ensure the usage is correct
rg 'offset'

Length of output: 66324

Script:

#!/bin/bash
# Search for LATEST_COMMIT references
rg "LATEST_COMMIT"

Length of output: 99

environment.yml (1)

6-12: Pinning specific package versions increases reproducibility.
This helps avoid unexpected breakages from upstream updates.

.github/workflows/test-repo.yml (1)

57-57: Updated micromamba action version helps maintain consistency.
Switching to “mamba-org/setup-micromamba@v2” aligns with the other workflow changes.

.github/workflows/generate.yml (5)

5-5: Weekly schedule and additional branch trigger are logical.
Reducing the cron frequency to once a week can help conserve computing resources, and including the “dev” branch broadens testing.

Also applies to: 9-9

20-25: Migration to 'mamba-org/setup-micromamba@v2' and pulling latest changes.
This ensures deployment uses the new environment file and rebase prevents push conflicts.

Also applies to: 26-28

35-35: 'LATEST_COMMIT' environment variable.
Exposing it here matches the usage in 'generate-catalog.py' for filtering repositories by date.

43-43: Graceful handling of 'No changes to commit'.
Nicely avoids exit failure when there are no new modifications.

60-64: Consistent cleanup job improvements.
Applying the same micromamba upgrade and pull logic to the cleanup step keeps the workflows in sync.

Also applies to: 66-68, 82-82, 88-88

scripts/common.py (1)

Line range hint 45-54: LGTM! Well-implemented rate limit handling.

The new rate_limit_wait function is a solid implementation with:

Correct timestamp handling
Buffer time to ensure reset completion
Proper logging for debugging
Protection against negative wait times

README.md (1)

3-6: LGTM! Informative status badges added.

The addition of status badges provides valuable at-a-glance information about the project's health and activity.

scripts/generate-catalog.py

m-jahn · 2025-01-03T07:37:46Z

Dear admins, can somebody review this PR and merge if everything is OK?
I'm mentioning @johanneskoester but maybe other snakemake admins could take over too?

m-jahn · 2025-01-09T06:45:36Z

Any comments or reviews on this PR? It would be nice to fix this timely so that the catalog can build again.
I would also be interested in adding new features to the catalog in the future, it's worth it to be actively maintained!

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d521f16 and 8a51238.

📒 Files selected for processing (1)

scripts/generate-catalog.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

scripts/generate-catalog.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.8.2)

scripts/generate-catalog.py

14-14: Redefinition of unused Environment from line 12

Remove definition: Environment

(F811)

274-274: Function definition does not bind loop variable config_readme

(B023)

🔇 Additional comments (8)

scripts/generate-catalog.py (8)

9-10: LGTM: New imports support enhanced functionality.

The added imports for re and datetime modules are necessary for the new date-based filtering and output sanitization features.

17-26: LGTM: Cleaned up imports from common module.

Unused imports have been removed, improving code maintainability.

110-114: LGTM: Improved repository filtering with date threshold.

The addition of date-based filtering (pushed:>={date_threshold}) effectively implements the PR objective of processing only recently updated repositories, which should reduce build times and resource usage.

212-213: LGTM: Added proper rule file extension validation.

The check ensures that rule modules use the .smk extension, promoting consistency across workflows.

251-251: LGTM: Added security measure to prevent token exposure.

The regex pattern effectively sanitizes GitHub tokens from the linting output, preventing accidental exposure of sensitive information.

277-286: LGTM: Improved code readability.

The Repo instantiation has been reformatted for better readability while maintaining functionality.

296-298: LGTM: Improved efficiency of old repository handling.

The use of list comprehension provides a more efficient and readable way to extend the repository list.

31-33: Verify the impact of offset calculation change.

The offset calculation has been changed from division to multiplication (offset * 10). This significant change could affect repository pagination and processing.

Run this script to analyze the impact:

✅ Verification successful

Offset calculation change is correct ✅

The offset multiplication by 10 in generate-catalog.py is intentional for pagination purposes and works correctly with the offset value imported from common.py. The different offset calculations in cleanup-catalog.py serve a separate batch processing purpose.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check the usage of offset in the codebase
# and its historical values

# Check offset usage in code
rg -B 5 -A 5 'offset.*=.*' 

# Check historical offset values from git history
git log -p | rg -B 5 -A 5 'offset.*=.*'
Length of output: 2479

scripts/generate-catalog.py

m-jahn and others added 30 commits December 16, 2024 15:13

fix: attempt to update GH actions workflow for building catalog

a3648d8

feat: added some basic instructions based on catalog landing page

72d09f0

fix: try to fix env availability

b1609a0

Add changes

a140548

fix: added pull step in attempt to fix failureto push to non-updated …

65a52d9

…remote

Add changes

59ad127

fix: added pull step attempt 2

590c823

fix: added pull step attempt 3

469a84c

Add changes

3023cb3

Add changes

5b81db1

fix: restored original runner chunks (10 x 100)

1d26cc2

fix: minor improvements to doc, added badges

5c13f22

fix: issue with autostashing intermittent changes

4b105b2

Add changes

76a76c7

Add changes

e8232a2

fix: added problematic repos flagged by GH push protection

d5593ac

Add changes

b2d2d54

Add changes

7eedc80

Add changes

480c3a2

Add changes

959ed8d

Add changes

5604c32

Add changes

fe2404d

Add changes

bb05460

Add changes

8166763

fix: add filter argument to untar as suggested by warning

82a9443

fix: prevent possible exposure of tokens from snakemake --lint

816fc70

fix: added parameter to search only for repos that had recent updates

d058ab5

fix: full update of all repos

3a69a9e

Add changes

95da129

Add changes

6451871

github-actions bot added 3 commits December 20, 2024 11:19

Add changes

900ec02

Add changes

922633f

Add changes

d521f16

coderabbitai bot reviewed Dec 21, 2024

View reviewed changes

scripts/generate-catalog.py Outdated Show resolved Hide resolved

scripts/generate-catalog.py Show resolved Hide resolved

m-jahn mentioned this pull request Dec 22, 2024

workflow catalog not being build for past months #29

Open

Update scripts/generate-catalog.py

8a51238

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Jan 16, 2025

View reviewed changes

scripts/generate-catalog.py Show resolved Hide resolved

remove duplicate import

4e86f3c

johanneskoester merged commit 459e834 into snakemake:main Jan 16, 2025
1 check was pending

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: various changes to make building of catalog working again #30

feat: various changes to make building of catalog working again #30

m-jahn commented Dec 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 21, 2024 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

m-jahn commented Jan 3, 2025

m-jahn commented Jan 9, 2025

coderabbitai bot left a comment

feat: various changes to make building of catalog working again #30

feat: various changes to make building of catalog working again #30

Conversation

m-jahn commented Dec 21, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 21, 2024 • edited Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

m-jahn commented Jan 3, 2025

m-jahn commented Jan 9, 2025

coderabbitai bot left a comment

Choose a reason for hiding this comment

m-jahn commented Dec 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 21, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)