Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed the logic of fetching leader's data #990

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

samyak003
Copy link
Collaborator

Fixes: #972

@samyak003 samyak003 requested a review from arkid15r as a code owner March 3, 2025 20:34
Copy link

coderabbitai bot commented Mar 3, 2025

Summary by CodeRabbit

  • New Features

    • Enhanced leader data presentation across chapters, committees, and projects for more accurate information.
  • Refactor

    • Streamlined the internal process for obtaining leader information using additional context to improve data accuracy and error handling.
  • Tests

    • Updated tests to verify the revised leader information retrieval, ensuring consistent and reliable results.

Walkthrough

The changes update the retrieval of leader information across several OWASP modules. Instead of invoking get_leaders() with no arguments, the code now retrieves a repository from the respective objects and passes it to get_leaders(repository). Consequently, the implementation of get_leaders in the scraper module has been updated to fetch leader data from a GitHub-hosted Markdown file, including error handling and logging modifications. The test suite has also been adjusted to supply a mock repository context.

Changes

File(s) Summary of Changes
`backend/apps/owasp/management/commands/owasp_scrape_(chapters committees
backend/apps/owasp/scraper.py Modified get_leaders to accept a repository parameter, fetch leader data from a leaders.md file, and include error handling and logging; added required imports.
backend/tests/owasp/scraper_test.py Updated tests to create a mock repository object and pass it as an argument to get_leaders, aligning with the new method signature.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

sonarqubecloud bot commented Mar 3, 2025

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
backend/apps/owasp/scraper.py (2)

70-73: Consider more robust regex pattern and result handling.

The current regex pattern works for simple cases but may miss leaders with special characters or alternate Markdown link formats.

Consider a more robust implementation:

        for line in lines:
            logger.debug("Processing line: %s", line)
-           match = re.findall(r"\* \[([^\]]+)\]", line)
+           # Match both standard Markdown list items with links and variations
+           match = re.findall(r"\*\s*\[([^\]]+)\](?:\([^)]*\))?", line)
            leaders.extend(match)

This pattern also handles optional spaces after the asterisk and optional link URLs in parentheses after the name.


69-69: Consider limiting debug logging for large content.

Logging the entire content may flood logs if the file is large.

Consider limiting the logged content:

-           logger.debug("Content: %s", content)
+           logger.debug("Content length: %d characters", len(content))
+           if len(content) < 500:  # Only log full content if it's reasonably small
+               logger.debug("Content: %s", content)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c55e7ee and 198f9e3.

📒 Files selected for processing (5)
  • backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_committees.py (1 hunks)
  • backend/apps/owasp/management/commands/owasp_scrape_projects.py (1 hunks)
  • backend/apps/owasp/scraper.py (2 hunks)
  • backend/tests/owasp/scraper_test.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: CodeQL (python)
  • GitHub Check: Run frontend unit tests
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (7)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1)

66-67:

✅ Verification successful

Updated leader fetching to use repository context - good approach.

The change correctly implements the new pattern of passing the repository context to the get_leaders method. This is consistent with the changes in other scraper commands and ensures that leader data is retrieved within the proper repository context.


🏁 Script executed:

#!/bin/bash
# Check if all scraper commands consistently use the repository parameter

# Check if other files implement the same pattern for fetching leaders
echo "Checking scraper command implementations for consistency:"
rg -l "\.owasp_repository" --type py backend/apps/owasp/management/commands/
rg "get_leaders\(.*repository" --type py backend/apps/owasp/management/commands/

Length of output: 941


Consistency Verified: Repository Parameter Use in Leader Fetching

The updated implementation in owasp_scrape_chapters.py is consistent with similar patterns in other scraper commands. All files now correctly pass the repository context to get_leaders, ensuring uniform behavior.

  • File: backend/apps/owasp/management/commands/owasp_scrape_chapters.py (Lines 66-67)
  • Verified in: owasp_scrape_committees.py & owasp_scrape_projects.py
backend/tests/owasp/scraper_test.py (1)

117-122:

✅ Verification successful

Test updated correctly with mock repository.

The test has been properly updated to accommodate the signature change in get_leaders(). The mock repository includes all the essential properties (key, default_branch, and name) that the updated method likely depends on. This ensures that the test accurately validates the new implementation.


🏁 Script executed:

#!/bin/bash
# Check if the get_leaders implementation relies on these repository properties

# Look at the implementation of get_leaders to verify if it uses these properties
rg -A 10 "def get_leaders" --type py backend/apps/owasp/

Length of output: 865


Action: Approved – Test correctly reflects new implementation expectations.

The updated test in backend/tests/owasp/scraper_test.py uses a mock repository with properties key, default_branch, and name. Verification confirms that while get_leaders() (located in backend/apps/owasp/scraper.py) relies on repository.key and repository.default_branch (to construct the URL for fetching the leaders file), the inclusion of repository.name is extraneous but harmless. The test accurately validates the method's behavior by asserting an empty list is returned.

backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)

78-79:

✅ Verification successful

Leader fetching updated to use repository context - ensures consistency.

The change aligns with the updated approach of fetching leaders with repository context, consistent with the implementations in other scraper commands. This provides a more contextual way to retrieve leader information.


🏁 Script executed:

#!/bin/bash
# Verify that owasp_repository is properly implemented in the Project model

# Check if the Project model has owasp_repository property/method implemented
echo "Checking Project model implementation:"
rg -A 5 "owasp_repository" --type py backend/apps/owasp/models/project.py

# Also check for any error handling around this property access
rg -C 3 "owasp_repository" --type py backend/apps/owasp/models/

Length of output: 6676


Confirmed: Leader fetching now utilizes the repository context consistently.

The change in backend/apps/owasp/management/commands/owasp_scrape_projects.py (lines 78-79) correctly fetches project.owasp_repository before retrieving the leaders. Verification shows that the owasp_repository property is properly implemented in the Project model—and similar patterns are applied in other models—ensuring a uniform approach across the codebase.

backend/apps/owasp/management/commands/owasp_scrape_committees.py (1)

66-67:

✅ Verification successful

Leader fetching updated to use repository context - maintains consistent approach.

The code now correctly retrieves the repository from the committee object before passing it to get_leaders(). This change is consistent with the implementation in other scraper commands and ensures a uniform approach to fetching leader data across different entity types.


🏁 Script executed:

#!/bin/bash
# Check if the Committee model properly implements owasp_repository
# and look for any implementation details of get_leaders in the scraper

# Check Committee model implementation
echo "Checking Committee model implementation:"
rg -A 5 "owasp_repository" --type py backend/apps/owasp/models/committee.py

# Look at how get_leaders uses the repository parameter
echo -e "\nChecking how get_leaders method uses the repository:"
rg -A 15 "def get_leaders" --type py backend/apps/owasp/scraper.py

Length of output: 1327


Consistent Repository Context for Leader Fetching

Verified that the committee’s owasp_repository is correctly set (see backend/apps/owasp/models/committee.py) and that scraper.get_leaders() properly uses the repository context (see backend/apps/owasp/scraper.py). The update in backend/apps/owasp/management/commands/owasp_scrape_committees.py (lines 66-67) now aligns perfectly with similar implementations across the codebase. No further action is needed.

backend/apps/owasp/scraper.py (3)

4-4: Appropriate imports for the new implementation.

Good job adding the necessary imports for the new leader fetching logic. The regex module for pattern matching and the GitHub utility function are well-placed.

Also applies to: 12-13


61-65: Method signature change with clear implementation.

Good refactoring to change from HTML parsing to a more structured approach using the GitHub-hosted Markdown file. The method signature change is appropriate as it now requires repository context.


63-65:

❓ Verification inconclusive

Verify GitHub raw content URL format.

The URL construction assumes a specific GitHub raw content URL format. Make sure this follows GitHub's stable URL scheme for accessing raw file content.


🏁 Script executed:

#!/bin/bash
# Check if the code already handles repositories with spaces or special characters in names
cd $(git rev-parse --show-toplevel)
url_pattern="https://raw.githubusercontent.com/OWASP/.*/.*/.*/leaders.md"
rg -l "https://raw.githubusercontent.com/OWASP/" | xargs rg -l "leaders.md" | xargs cat | rg "$url_pattern" -I --color=always

Length of output: 303


Review the GitHub Raw URL Construction

The URL built in the code:

content = get_repository_file_content(
    f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
)

correctly follows GitHub’s stable raw content URL scheme. However, the verification shell script’s regex pattern expects one more dynamic segment than actually exists (it uses three ".*" groups after /OWASP/ instead of the two provided by {repository.key} and {repository.default_branch}). This mismatch likely explains why the script produced no output.

Action Items:

  • Confirm that the URL format remains consistent with GitHub’s spec (i.e., https://raw.githubusercontent.com/OWASP/<repository_key>/<default_branch>/leaders.md).
  • If you intend to validate this URL using a regex, update the pattern (for example, to https://raw.githubusercontent.com/OWASP/[^/]+/[^/]+/leaders.md) so it accurately reflects the URL’s structure.

Comment on lines +61 to +78
def get_leaders(self, repository):
"""Get leaders from leaders.md file on GitHub."""
content = get_repository_file_content(
f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
)
leaders = []
try:
lines = content.split("\n")
logger.debug("Content: %s", content)
for line in lines:
logger.debug("Processing line: %s", line)
match = re.findall(r"\* \[([^\]]+)\]", line)
leaders.extend(match)
except AttributeError:
logger.exception(
"Unable to parse leaders.md content", extra={"repository": repository.name}
)
return leaders
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Additional error handling needed for network/GitHub API failures.

The current error handling only catches AttributeError from content parsing, but there's no handling for potential network issues or GitHub API failures that could occur in get_repository_file_content().

Consider adding error handling for network-related exceptions:

    def get_leaders(self, repository):
        """Get leaders from leaders.md file on GitHub."""
-       content = get_repository_file_content(
-           f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
-       )
+       try:
+           content = get_repository_file_content(
+               f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
+           )
+       except (requests.exceptions.RequestException, ValueError) as e:
+           logger.exception(
+               "Failed to fetch leaders.md file", 
+               extra={"repository": repository.name, "error": str(e)}
+           )
+           return []
        leaders = []
        try:
            lines = content.split("\n")
            logger.debug("Content: %s", content)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_leaders(self, repository):
"""Get leaders from leaders.md file on GitHub."""
content = get_repository_file_content(
f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
)
leaders = []
try:
lines = content.split("\n")
logger.debug("Content: %s", content)
for line in lines:
logger.debug("Processing line: %s", line)
match = re.findall(r"\* \[([^\]]+)\]", line)
leaders.extend(match)
except AttributeError:
logger.exception(
"Unable to parse leaders.md content", extra={"repository": repository.name}
)
return leaders
def get_leaders(self, repository):
"""Get leaders from leaders.md file on GitHub."""
try:
content = get_repository_file_content(
f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
)
except (requests.exceptions.RequestException, ValueError) as e:
logger.exception(
"Failed to fetch leaders.md file",
extra={"repository": repository.name, "error": str(e)}
)
return []
leaders = []
try:
lines = content.split("\n")
logger.debug("Content: %s", content)
for line in lines:
logger.debug("Processing line: %s", line)
match = re.findall(r"\* \[([^\]]+)\]", line)
leaders.extend(match)
except AttributeError:
logger.exception(
"Unable to parse leaders.md content", extra={"repository": repository.name}
)
return leaders

Comment on lines +67 to +70
try:
lines = content.split("\n")
logger.debug("Content: %s", content)
for line in lines:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check for empty content before processing.

There's no check for empty content before attempting to split and process it, which could lead to errors if the file is empty or not found.

Add a check for empty content:

        leaders = []
        try:
+           if not content:
+               logger.warning("Empty leaders.md content", extra={"repository": repository.name})
+               return leaders
            lines = content.split("\n")
            logger.debug("Content: %s", content)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
lines = content.split("\n")
logger.debug("Content: %s", content)
for line in lines:
leaders = []
try:
if not content:
logger.warning("Empty leaders.md content", extra={"repository": repository.name})
return leaders
lines = content.split("\n")
logger.debug("Content: %s", content)
for line in lines:

Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the logic to RepositoryBasedEntityMode::get_leaders

upd: also some tests would be great 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move leaders data fetch logic from scraping to github files parsing
2 participants