-
-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed the logic of fetching leader's data #990
base: main
Are you sure you want to change the base?
Conversation
Summary by CodeRabbit
WalkthroughThe changes update the retrieval of leader information across several OWASP modules. Instead of invoking Changes
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
backend/apps/owasp/scraper.py (2)
70-73
: Consider more robust regex pattern and result handling.The current regex pattern works for simple cases but may miss leaders with special characters or alternate Markdown link formats.
Consider a more robust implementation:
for line in lines: logger.debug("Processing line: %s", line) - match = re.findall(r"\* \[([^\]]+)\]", line) + # Match both standard Markdown list items with links and variations + match = re.findall(r"\*\s*\[([^\]]+)\](?:\([^)]*\))?", line) leaders.extend(match)This pattern also handles optional spaces after the asterisk and optional link URLs in parentheses after the name.
69-69
: Consider limiting debug logging for large content.Logging the entire content may flood logs if the file is large.
Consider limiting the logged content:
- logger.debug("Content: %s", content) + logger.debug("Content length: %d characters", len(content)) + if len(content) < 500: # Only log full content if it's reasonably small + logger.debug("Content: %s", content)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_committees.py
(1 hunks)backend/apps/owasp/management/commands/owasp_scrape_projects.py
(1 hunks)backend/apps/owasp/scraper.py
(2 hunks)backend/tests/owasp/scraper_test.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: CodeQL (python)
- GitHub Check: Run frontend unit tests
- GitHub Check: Run frontend e2e tests
- GitHub Check: Run backend tests
- GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (7)
backend/apps/owasp/management/commands/owasp_scrape_chapters.py (1)
66-67
:✅ Verification successful
Updated leader fetching to use repository context - good approach.
The change correctly implements the new pattern of passing the repository context to the
get_leaders
method. This is consistent with the changes in other scraper commands and ensures that leader data is retrieved within the proper repository context.
🏁 Script executed:
#!/bin/bash # Check if all scraper commands consistently use the repository parameter # Check if other files implement the same pattern for fetching leaders echo "Checking scraper command implementations for consistency:" rg -l "\.owasp_repository" --type py backend/apps/owasp/management/commands/ rg "get_leaders\(.*repository" --type py backend/apps/owasp/management/commands/Length of output: 941
Consistency Verified: Repository Parameter Use in Leader Fetching
The updated implementation in
owasp_scrape_chapters.py
is consistent with similar patterns in other scraper commands. All files now correctly pass the repository context toget_leaders
, ensuring uniform behavior.
- File:
backend/apps/owasp/management/commands/owasp_scrape_chapters.py
(Lines 66-67)- Verified in:
owasp_scrape_committees.py
&owasp_scrape_projects.py
backend/tests/owasp/scraper_test.py (1)
117-122
:✅ Verification successful
Test updated correctly with mock repository.
The test has been properly updated to accommodate the signature change in
get_leaders()
. The mock repository includes all the essential properties (key
,default_branch
, andname
) that the updated method likely depends on. This ensures that the test accurately validates the new implementation.
🏁 Script executed:
#!/bin/bash # Check if the get_leaders implementation relies on these repository properties # Look at the implementation of get_leaders to verify if it uses these properties rg -A 10 "def get_leaders" --type py backend/apps/owasp/Length of output: 865
Action: Approved – Test correctly reflects new implementation expectations.
The updated test in
backend/tests/owasp/scraper_test.py
uses a mock repository with propertieskey
,default_branch
, andname
. Verification confirms that whileget_leaders()
(located inbackend/apps/owasp/scraper.py
) relies onrepository.key
andrepository.default_branch
(to construct the URL for fetching the leaders file), the inclusion ofrepository.name
is extraneous but harmless. The test accurately validates the method's behavior by asserting an empty list is returned.backend/apps/owasp/management/commands/owasp_scrape_projects.py (1)
78-79
:✅ Verification successful
Leader fetching updated to use repository context - ensures consistency.
The change aligns with the updated approach of fetching leaders with repository context, consistent with the implementations in other scraper commands. This provides a more contextual way to retrieve leader information.
🏁 Script executed:
#!/bin/bash # Verify that owasp_repository is properly implemented in the Project model # Check if the Project model has owasp_repository property/method implemented echo "Checking Project model implementation:" rg -A 5 "owasp_repository" --type py backend/apps/owasp/models/project.py # Also check for any error handling around this property access rg -C 3 "owasp_repository" --type py backend/apps/owasp/models/Length of output: 6676
Confirmed: Leader fetching now utilizes the repository context consistently.
The change in
backend/apps/owasp/management/commands/owasp_scrape_projects.py
(lines 78-79) correctly fetchesproject.owasp_repository
before retrieving the leaders. Verification shows that theowasp_repository
property is properly implemented in the Project model—and similar patterns are applied in other models—ensuring a uniform approach across the codebase.backend/apps/owasp/management/commands/owasp_scrape_committees.py (1)
66-67
:✅ Verification successful
Leader fetching updated to use repository context - maintains consistent approach.
The code now correctly retrieves the repository from the committee object before passing it to
get_leaders()
. This change is consistent with the implementation in other scraper commands and ensures a uniform approach to fetching leader data across different entity types.
🏁 Script executed:
#!/bin/bash # Check if the Committee model properly implements owasp_repository # and look for any implementation details of get_leaders in the scraper # Check Committee model implementation echo "Checking Committee model implementation:" rg -A 5 "owasp_repository" --type py backend/apps/owasp/models/committee.py # Look at how get_leaders uses the repository parameter echo -e "\nChecking how get_leaders method uses the repository:" rg -A 15 "def get_leaders" --type py backend/apps/owasp/scraper.pyLength of output: 1327
Consistent Repository Context for Leader Fetching
Verified that the committee’s
owasp_repository
is correctly set (seebackend/apps/owasp/models/committee.py
) and thatscraper.get_leaders()
properly uses the repository context (seebackend/apps/owasp/scraper.py
). The update inbackend/apps/owasp/management/commands/owasp_scrape_committees.py
(lines 66-67) now aligns perfectly with similar implementations across the codebase. No further action is needed.backend/apps/owasp/scraper.py (3)
4-4
: Appropriate imports for the new implementation.Good job adding the necessary imports for the new leader fetching logic. The regex module for pattern matching and the GitHub utility function are well-placed.
Also applies to: 12-13
61-65
: Method signature change with clear implementation.Good refactoring to change from HTML parsing to a more structured approach using the GitHub-hosted Markdown file. The method signature change is appropriate as it now requires repository context.
63-65
:❓ Verification inconclusive
Verify GitHub raw content URL format.
The URL construction assumes a specific GitHub raw content URL format. Make sure this follows GitHub's stable URL scheme for accessing raw file content.
🏁 Script executed:
#!/bin/bash # Check if the code already handles repositories with spaces or special characters in names cd $(git rev-parse --show-toplevel) url_pattern="https://raw.githubusercontent.com/OWASP/.*/.*/.*/leaders.md" rg -l "https://raw.githubusercontent.com/OWASP/" | xargs rg -l "leaders.md" | xargs cat | rg "$url_pattern" -I --color=alwaysLength of output: 303
Review the GitHub Raw URL Construction
The URL built in the code:
content = get_repository_file_content( f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md" )correctly follows GitHub’s stable raw content URL scheme. However, the verification shell script’s regex pattern expects one more dynamic segment than actually exists (it uses three ".*" groups after
/OWASP/
instead of the two provided by{repository.key}
and{repository.default_branch}
). This mismatch likely explains why the script produced no output.Action Items:
- Confirm that the URL format remains consistent with GitHub’s spec (i.e.,
https://raw.githubusercontent.com/OWASP/<repository_key>/<default_branch>/leaders.md
).- If you intend to validate this URL using a regex, update the pattern (for example, to
https://raw.githubusercontent.com/OWASP/[^/]+/[^/]+/leaders.md
) so it accurately reflects the URL’s structure.
def get_leaders(self, repository): | ||
"""Get leaders from leaders.md file on GitHub.""" | ||
content = get_repository_file_content( | ||
f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md" | ||
) | ||
leaders = [] | ||
try: | ||
lines = content.split("\n") | ||
logger.debug("Content: %s", content) | ||
for line in lines: | ||
logger.debug("Processing line: %s", line) | ||
match = re.findall(r"\* \[([^\]]+)\]", line) | ||
leaders.extend(match) | ||
except AttributeError: | ||
logger.exception( | ||
"Unable to parse leaders.md content", extra={"repository": repository.name} | ||
) | ||
return leaders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Additional error handling needed for network/GitHub API failures.
The current error handling only catches AttributeError
from content parsing, but there's no handling for potential network issues or GitHub API failures that could occur in get_repository_file_content()
.
Consider adding error handling for network-related exceptions:
def get_leaders(self, repository):
"""Get leaders from leaders.md file on GitHub."""
- content = get_repository_file_content(
- f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
- )
+ try:
+ content = get_repository_file_content(
+ f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md"
+ )
+ except (requests.exceptions.RequestException, ValueError) as e:
+ logger.exception(
+ "Failed to fetch leaders.md file",
+ extra={"repository": repository.name, "error": str(e)}
+ )
+ return []
leaders = []
try:
lines = content.split("\n")
logger.debug("Content: %s", content)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_leaders(self, repository): | |
"""Get leaders from leaders.md file on GitHub.""" | |
content = get_repository_file_content( | |
f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md" | |
) | |
leaders = [] | |
try: | |
lines = content.split("\n") | |
logger.debug("Content: %s", content) | |
for line in lines: | |
logger.debug("Processing line: %s", line) | |
match = re.findall(r"\* \[([^\]]+)\]", line) | |
leaders.extend(match) | |
except AttributeError: | |
logger.exception( | |
"Unable to parse leaders.md content", extra={"repository": repository.name} | |
) | |
return leaders | |
def get_leaders(self, repository): | |
"""Get leaders from leaders.md file on GitHub.""" | |
try: | |
content = get_repository_file_content( | |
f"https://raw.githubusercontent.com/OWASP/{repository.key}/{repository.default_branch}/leaders.md" | |
) | |
except (requests.exceptions.RequestException, ValueError) as e: | |
logger.exception( | |
"Failed to fetch leaders.md file", | |
extra={"repository": repository.name, "error": str(e)} | |
) | |
return [] | |
leaders = [] | |
try: | |
lines = content.split("\n") | |
logger.debug("Content: %s", content) | |
for line in lines: | |
logger.debug("Processing line: %s", line) | |
match = re.findall(r"\* \[([^\]]+)\]", line) | |
leaders.extend(match) | |
except AttributeError: | |
logger.exception( | |
"Unable to parse leaders.md content", extra={"repository": repository.name} | |
) | |
return leaders |
try: | ||
lines = content.split("\n") | ||
logger.debug("Content: %s", content) | ||
for line in lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check for empty content before processing.
There's no check for empty content before attempting to split and process it, which could lead to errors if the file is empty or not found.
Add a check for empty content:
leaders = []
try:
+ if not content:
+ logger.warning("Empty leaders.md content", extra={"repository": repository.name})
+ return leaders
lines = content.split("\n")
logger.debug("Content: %s", content)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
try: | |
lines = content.split("\n") | |
logger.debug("Content: %s", content) | |
for line in lines: | |
leaders = [] | |
try: | |
if not content: | |
logger.warning("Empty leaders.md content", extra={"repository": repository.name}) | |
return leaders | |
lines = content.split("\n") | |
logger.debug("Content: %s", content) | |
for line in lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the logic to RepositoryBasedEntityMode::get_leaders
upd: also some tests would be great 👍
Fixes: #972