Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: various changes to make building of catalog working again #30

Merged
merged 35 commits into from
Jan 16, 2025

Conversation

m-jahn
Copy link
Collaborator

@m-jahn m-jahn commented Dec 21, 2024

  • catalog has not been building correctly for about 6 months
  • deployment has stopped for about 3 months
  • this PR attempts to fix several problems as outlined below
  • it built correctly now on my fork with the updated GH actions workflow
  • after merging, workflow catalog should build again properly

Main changes:

  • updated the micromamba GH workflow, and the environment.yml file
  • added a pull step before changes are pushed, because it led to a git error if repo was not up-to-date
  • changed building policy to be more resource-efficient: the cron job is now only triggered once per week (instead of daily), and only repos that have had changes in the past 7 days are parsed for updates -- this reduces build time from >24h to 10h and reduces risk of failing jobs
  • added function to remove github access tokens from snakemake --lint output which sometimes get exposed
  • updated the untar comment to fix warnings from python
  • added full update of all repos (not only the ones with latest changes)
  • added some documentation to the README.md

Summary by CodeRabbit

  • New Features

    • Expanded the README.md to include new sections on workflow generation, deployment, and contributing guidelines.
    • Enhanced the generate-catalog script to filter repositories based on last push date and improved error handling.
  • Bug Fixes

    • Improved error handling in the catalog generation process and refined the logic for managing rate limits.
  • Documentation

    • Added badges and detailed descriptions in the README.md for better visibility of repository status.
  • Chores

    • Updated workflow configurations for scheduling and branch handling.
    • Added new entries to the blacklist.txt file.
    • Specified exact versions for dependencies in environment.yml.
    • Updated deployment actions in workflow configurations.
    • Cleaned up unused imports in scripts to streamline the codebase.

Copy link
Contributor

coderabbitai bot commented Dec 21, 2024

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This pull request introduces updates to the Snakemake Workflow Catalog repository, focusing on workflow configuration, dependency management, and script improvements. The changes span multiple files, including GitHub workflow configurations, README documentation, environment specifications, and core scripts. Key modifications include updating deployment actions, refining repository search logic, enhancing rate limit handling, and specifying exact dependency versions. The updates aim to improve the catalog's reliability, maintainability, and precision in processing and documenting Snakemake workflows.

Changes

File Change Summary
.github/workflows/generate.yml - Updated workflow schedule to weekly on Mondays at 5 AM
- Added dev branch to push triggers
- Changed deployment action to mamba-org/setup-micromamba@v2
- Added environment file parameter
- Improved commit handling with error checking
.github/workflows/test-repo.yml - Updated deployment action to mamba-org/setup-micromamba@v2
- Refined repository name extraction from comments
- Maintained rate limit and comment handling logic
README.md - Added badges for workflow status
- Expanded sections on catalog purpose, workflows, and usage
- Included guidelines for contributions
blacklist.txt - Added new repository entries:
- snakemake/snakemake-workflow-catalog
- tdayris/fair_bowtie2_mapping
- tdayris/fair_fastqc_multiqc
- GiulioCentorame/FADS-by-breastfeeding
environment.yml - Specified exact versions for dependencies:
- jinja2=3.1.4
- pygithub=2.5.0
- gitpython=3.1.43
- snakemake=8.25.5
- snakefmt=0.10.2
- python=3.12.8
- ratelimit=2.2.1
scripts/common.py - Removed unused imports
- Added rate_limit_wait() function
- Enhanced rate limit handling in call_rate_limit_aware()
scripts/generate-catalog.py - Added latest_commit variable
- Updated repository search logic
- Improved error handling and logging
- Modified offset calculation
- Enhanced repository processing

Sequence Diagram

sequenceDiagram
    participant Workflow as GitHub Workflow
    participant Script as generate-catalog.py
    participant GitHub as GitHub API
    participant Repo as Repository

    Workflow->>Script: Trigger catalog generation
    Script->>GitHub: Search repositories
    GitHub-->>Script: Return repository list
    Script->>Script: Apply date threshold
    Script->>Repo: Validate repository
    alt Repository meets criteria
        Script->>Repo: Process repository
        Script->>Workflow: Generate catalog entry
    else Repository skipped
        Script->>Workflow: Log skip reason
    end
Loading

This sequence diagram illustrates the high-level workflow of the catalog generation process, highlighting the key steps of repository search, validation, and processing.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a51238 and 4e86f3c.

📒 Files selected for processing (1)
  • scripts/generate-catalog.py (7 hunks)

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
README.md (4)

26-31: Fix minor grammatical issues in the workflow criteria section.

Apply these corrections:

-The workflow is contained in a public Github repository.
+The workflow is contained in a public GitHub repository.
-very large files should be handled via [Git LFS]
+large files should be handled via [Git LFS]
-, so that they can be stripped out during cloning).
+ so that they can be stripped out during cloning).
🧰 Tools
🪛 LanguageTool

[misspelling] ~27-~27: This word is normally spelled with a hyphen.
Context: ...g the words "snakemake" and "workflow" (case insensitive). - The repository contains a workflow ...

(EN_COMPOUNDS_CASE_INSENSITIVE)


[style] ~30-~30: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ions/understanding-github-actions) job (very large files should be handled via [Git LFS](h...

(EN_WEAK_ADJECTIVE)


[typographical] ~30-~30: The conjunction “so that” does not have a comma in front.
Context: ...working-with-files/managing-large-files), so that they can be stripped out during cloning...

(SO_THAT_UNNECESSARY_COMMA)


35-39: Improve clarity in the standardized usage section.

Apply these corrections:

-In order to additionally appear in the "standardized usage" area, repositories additionally have to:
+To appear in the "standardized usage" area, repositories must:

55-55: Fix grammatical issues in the usage instructions.

Apply these corrections:

-Once included in the standardized usage area you can link
+Once included in the standardized usage area, you can link
-Hence, in order to update your workflow's records here,
+Hence, to update your workflow's records here,

Also applies to: 59-59

🧰 Tools
🪛 LanguageTool

[uncategorized] ~55-~55: Possible missing comma found.
Context: ...Once included in the standardized usage area you can link directly to the usage inst...

(AI_HYDRA_LEO_MISSING_COMMA)


80-80: Format the URL properly.

Convert the bare URL to a proper markdown link:

-For more information and to explore the available workflows, visit https://snakemake.github.io/snakemake-workflow-catalog/.
+For more information and to explore the available workflows, visit [the Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/).
🧰 Tools
🪛 Markdownlint (0.37.0)

80-80: null
Bare URL used

(MD034, no-bare-urls)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f63cd80 and d521f16.

📒 Files selected for processing (7)
  • .github/workflows/generate.yml (4 hunks)
  • .github/workflows/test-repo.yml (1 hunks)
  • README.md (1 hunks)
  • blacklist.txt (1 hunks)
  • environment.yml (1 hunks)
  • scripts/common.py (1 hunks)
  • scripts/generate-catalog.py (7 hunks)
✅ Files skipped from review due to trivial changes (1)
  • blacklist.txt
🧰 Additional context used
📓 Path-based instructions (2)
scripts/common.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

scripts/generate-catalog.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.8.2)
scripts/generate-catalog.py

14-14: Redefinition of unused Environment from line 12

Remove definition: Environment

(F811)


19-19: common.check_repo_exists imported but unused

Remove unused import: common.check_repo_exists

(F401)


275-275: Function definition does not bind loop variable config_readme

(B023)

🪛 LanguageTool
README.md

[misspelling] ~27-~27: This word is normally spelled with a hyphen.
Context: ...g the words "snakemake" and "workflow" (case insensitive). - The repository contains a workflow ...

(EN_COMPOUNDS_CASE_INSENSITIVE)


[style] ~30-~30: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ions/understanding-github-actions) job (very large files should be handled via [Git LFS](h...

(EN_WEAK_ADJECTIVE)


[typographical] ~30-~30: The conjunction “so that” does not have a comma in front.
Context: ...working-with-files/managing-large-files), so that they can be stripped out during cloning...

(SO_THAT_UNNECESSARY_COMMA)


[style] ~34-~34: Consider a shorter alternative to avoid wordiness.
Context: ...ere. ### Standardized usage workflows In order to additionally appear in the "standardize...

(IN_ORDER_TO_PREMIUM)


[uncategorized] ~55-~55: Possible missing comma found.
Context: ...Once included in the standardized usage area you can link directly to the usage inst...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~59-~59: Consider a shorter alternative to avoid wordiness.
Context: ... the latest non-preview release. Hence, in order to update your workflow's records here, yo...

(IN_ORDER_TO_PREMIUM)

🪛 Markdownlint (0.37.0)
README.md

80-80: null
Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (15)
scripts/generate-catalog.py (6)

9-10: Imports for regex and datetime calculations look good.
These imports are necessary for filtering tokens and implementing the date threshold.


111-112: Date threshold-based repository search is a solid improvement.
Using “pushed:>=” in the search query helps filter stale repositories.

Also applies to: 114-115


252-252: Redacting GitHub tokens is a good security measure.
Ensure the regex is comprehensive enough to catch all possible token formats.


272-272: Topic retrieval logic is appropriate.
This step ensures relevant repository metadata is captured.


278-287: Repo object instantiation is consistent with the updated fields.
The logic for setting the usage flags, release information, readme content, etc., is clear and cohesive.


32-33: Verify the logic behind 'latest_commit' and scaled 'offset'.
• "latest_commit" is set from an environment variable to determine how far back you look for updates. Ensure that all related calls or references are aware of this new usage.
• Changing "offset" from its old calculation to "offset * 10" can drastically expand your repository search. Double-check that this multiplication aligns with your desired iteration logic.

✅ Verification successful

Let me search for references to LATEST_COMMIT to complete the verification.


Based on the analysis of both offset and LATEST_COMMIT usage in the codebase, I can now provide the final response:

The offset multiplication and latest_commit usage are correctly scoped
• The offset * 10 scaling is used consistently within generate-catalog.py for pagination, with proper bounds checking via min(offset + 100, total_count).
LATEST_COMMIT is only used once to determine the search boundary and doesn't have any dependent calls that need updating.
• The offset value is properly initialized from environment variables in common.py and used appropriately across the cleanup-catalog.py script.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching all references to 'offset' in the codebase to ensure the usage is correct
rg 'offset'

Length of output: 66324


Script:

#!/bin/bash
# Search for LATEST_COMMIT references
rg "LATEST_COMMIT"

Length of output: 99

environment.yml (1)

6-12: Pinning specific package versions increases reproducibility.
This helps avoid unexpected breakages from upstream updates.

.github/workflows/test-repo.yml (1)

57-57: Updated micromamba action version helps maintain consistency.
Switching to “mamba-org/setup-micromamba@v2” aligns with the other workflow changes.

.github/workflows/generate.yml (5)

5-5: Weekly schedule and additional branch trigger are logical.
Reducing the cron frequency to once a week can help conserve computing resources, and including the “dev” branch broadens testing.

Also applies to: 9-9


20-25: Migration to 'mamba-org/setup-micromamba@v2' and pulling latest changes.
This ensures deployment uses the new environment file and rebase prevents push conflicts.

Also applies to: 26-28


35-35: 'LATEST_COMMIT' environment variable.
Exposing it here matches the usage in 'generate-catalog.py' for filtering repositories by date.


43-43: Graceful handling of 'No changes to commit'.
Nicely avoids exit failure when there are no new modifications.


60-64: Consistent cleanup job improvements.
Applying the same micromamba upgrade and pull logic to the cleanup step keeps the workflows in sync.

Also applies to: 66-68, 82-82, 88-88

scripts/common.py (1)

Line range hint 45-54: LGTM! Well-implemented rate limit handling.

The new rate_limit_wait function is a solid implementation with:

  • Correct timestamp handling
  • Buffer time to ensure reset completion
  • Proper logging for debugging
  • Protection against negative wait times
README.md (1)

3-6: LGTM! Informative status badges added.

The addition of status badges provides valuable at-a-glance information about the project's health and activity.

scripts/generate-catalog.py Outdated Show resolved Hide resolved
scripts/generate-catalog.py Show resolved Hide resolved
@m-jahn
Copy link
Collaborator Author

m-jahn commented Jan 3, 2025

Dear admins, can somebody review this PR and merge if everything is OK?
I'm mentioning @johanneskoester but maybe other snakemake admins could take over too?

@m-jahn
Copy link
Collaborator Author

m-jahn commented Jan 9, 2025

Any comments or reviews on this PR? It would be nice to fix this timely so that the catalog can build again.
I would also be interested in adding new features to the catalog in the future, it's worth it to be actively maintained!

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d521f16 and 8a51238.

📒 Files selected for processing (1)
  • scripts/generate-catalog.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
scripts/generate-catalog.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.8.2)
scripts/generate-catalog.py

14-14: Redefinition of unused Environment from line 12

Remove definition: Environment

(F811)


274-274: Function definition does not bind loop variable config_readme

(B023)

🔇 Additional comments (8)
scripts/generate-catalog.py (8)

9-10: LGTM: New imports support enhanced functionality.

The added imports for re and datetime modules are necessary for the new date-based filtering and output sanitization features.


17-26: LGTM: Cleaned up imports from common module.

Unused imports have been removed, improving code maintainability.


110-114: LGTM: Improved repository filtering with date threshold.

The addition of date-based filtering (pushed:>={date_threshold}) effectively implements the PR objective of processing only recently updated repositories, which should reduce build times and resource usage.


212-213: LGTM: Added proper rule file extension validation.

The check ensures that rule modules use the .smk extension, promoting consistency across workflows.


251-251: LGTM: Added security measure to prevent token exposure.

The regex pattern effectively sanitizes GitHub tokens from the linting output, preventing accidental exposure of sensitive information.


277-286: LGTM: Improved code readability.

The Repo instantiation has been reformatted for better readability while maintaining functionality.


296-298: LGTM: Improved efficiency of old repository handling.

The use of list comprehension provides a more efficient and readable way to extend the repository list.


31-33: Verify the impact of offset calculation change.

The offset calculation has been changed from division to multiplication (offset * 10). This significant change could affect repository pagination and processing.

Run this script to analyze the impact:

✅ Verification successful

Offset calculation change is correct

The offset multiplication by 10 in generate-catalog.py is intentional for pagination purposes and works correctly with the offset value imported from common.py. The different offset calculations in cleanup-catalog.py serve a separate batch processing purpose.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the usage of offset in the codebase
# and its historical values

# Check offset usage in code
rg -B 5 -A 5 'offset.*=.*' 

# Check historical offset values from git history
git log -p | rg -B 5 -A 5 'offset.*=.*'

Length of output: 2479

scripts/generate-catalog.py Show resolved Hide resolved
@johanneskoester johanneskoester merged commit 459e834 into snakemake:main Jan 16, 2025
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants