Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: locale folder reading + disambiguation #86

Merged
merged 7 commits into from
Jan 4, 2025
Merged

fix: locale folder reading + disambiguation #86

merged 7 commits into from
Jan 4, 2025

Conversation

JarbasAl
Copy link
Member

@JarbasAl JarbasAl commented Dec 31, 2024

companion to OpenVoiceOS/OVOS-workshop#315

Summary by CodeRabbit

  • New Features

    • Enhanced Wikipedia search functionality with improved concurrent data retrieval and error handling.
    • Added robust keyword extraction and matching mechanisms.
  • Refactor

    • Updated skill inheritance and method signatures for clarity and maintainability.
    • Improved data retrieval and result processing logic.
  • Bug Fixes

    • Enhanced handling for potential exceptions during data retrieval.
    • Implemented effective error management for missing language directories.
  • Chores

    • Updated dependency version constraints for ovos_workshop to a more recent version.

Copy link

coderabbitai bot commented Dec 31, 2024

Walkthrough

The pull request introduces significant enhancements to the Wikipedia-related functionality in the __init__.py file. The changes focus on improving the WikipediaSolver and WikipediaSkill classes by refactoring their implementation, adding type hints, and enhancing error handling. Modifications include updating method signatures, introducing new methods for concurrent data retrieval, changing class inheritance, and streamlining query matching and result processing mechanisms. Additionally, the requirements.txt file has been updated to reflect a newer version of a dependency.

Changes

File Changes
__init__.py - Added imports for concurrency and template expansion
- Updated rm_parentheses method signature
- Replaced get_data with get_page_data and introduced new get_data for concurrent fetching
- Refactored WikipediaSkill to inherit from OVOSSkill
- Renamed CQS_match_query_phrase to match_common_query
- Enhanced error handling and logging throughout
- Removed deprecated code and adjusted the main execution block
requirements.txt - Updated ovos_workshop version from >=0.0.15,<4.0.0 to >=3.3.0,<4.0.0

Poem

🐰 A Rabbit's Ode to Wiki Wisdom 🌐

In lines of code, a solver grows,
With type hints dancing on its toes,
Parentheses now neatly trimmed,
Our knowledge quest no longer dimmed!

Hop, hop, through data's grand terrain! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
__init__.py (3)

107-124: Efficient summarization logic
Flattening the tokenized lines at [line 107] fits well with returning user-friendly summaries. Both the single-result flow ([lines 111–116]) and the multi-result flow ([lines 117–123]) seem logically consistent. Keep in mind potential edge cases where an article has a very short extract with few sentences. Otherwise, this is a clean approach.


Line range hint 261-282: match_common_query effectively delegates to ask_the_wiki
The revised method name (match_common_query) and signature properly signals the method’s purpose while returning consistent data. Associating a 0.6 confidence score at [line 282] is a strategic choice for the Common Query framework. Ensure that the approach for partial matches remains balanced if future expansions of the skill yield multiple potential answers.


Line range hint 350-373: Retain or remove debug prints as appropriate
These lines provide helpful testing and debug output. If they remain for demonstration purposes, consider toggling them with a debug flag or docstring references for clarity once deployed. Otherwise, removing or commenting them out keeps production logs clean.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da95ed4 and 0ccf788.

📒 Files selected for processing (1)
  • __init__.py (11 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
__init__.py

183-183: Ambiguous variable name: l

(E741)

🔇 Additional comments (8)
__init__.py (8)

14-27: Imports look good overall, but consider consolidating or removing any unused ones
The newly introduced imports (Optional, Tuple, List from typing, closest_supported_match, classproperty, flatten_list, etc.) appear purposeful and relevant to the refactor. If any imports end up being unused after the final iteration, consider removing them to keep the module clean.


32-32: Explicit return type for rm_parentheses
Defining the return type as str clarifies the function’s intent and ensures that the caller knows exactly what to expect. Good change!


38-38: Type alias WikiMatches improves code readability
Providing a descriptive alias for a list of tuples is a neat improvement, making it clearer that each tuple refers to (title, list_of_sentences, img_url).


87-92: Clear debugging and disambiguation logic
Logging the number of matched Wikipedia pages at [line 87] helps trace results. Skipping "disambiguation" pages at [line 91] is correct to avoid returning them as valid items. Good job ensuring the solver avoids hitting these placeholders.


98-98: Safe optional image handling
Declaring img: Optional[str] = None clarifies that an image may be absent. This approach is consistent with the puzzle's partial returns.


160-160: Updated class inheritance
Refactoring WikipediaSkill to inherit from OVOSSkill is consistent with the direction indicated in the PR summary. This ensures alignment with the new skill structure in OVOS.


250-260: cq_callback gracefully updates GUI context
The callback sets the WikiKnows context, displays search results, and logs any missing session in [lines 257–258]. This centralized approach effectively handles the UI flow and error logging.


439-440: Multilingual examples confirm robust language handling
Testing with Portuguese queries demonstrates the solver’s adaptability. Thorough language checks like these are pivotal when shipping cross-lingual features.

__init__.py Show resolved Hide resolved
@JarbasAl JarbasAl changed the title refactor: new common query decorators fix: locale folder reading + disambiguation Dec 31, 2024
@JarbasAl JarbasAl added the fix label Dec 31, 2024
@github-actions github-actions bot added fix and removed fix labels Dec 31, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
__init__.py (2)

66-85: LGTM! Well-structured error handling for Wikipedia API calls.

The method properly handles exceptions and includes informative error logging.

Consider making the timeout value configurable through a class attribute or parameter:

- response = requests.get(url, timeout=5).json()
+ response = requests.get(url, timeout=self.timeout).json()

Line range hint 370-460: Consider moving test cases to a dedicated test file.

The main execution block contains valuable test cases that should be moved to a proper test file for better organization and maintainability.

Consider creating a new file test_wikipedia.py and moving these test cases there using a proper testing framework like pytest.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ccf788 and d16b325.

📒 Files selected for processing (1)
  • __init__.py (11 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
__init__.py

202-202: Ambiguous variable name: l

(E741)

🔇 Additional comments (5)
__init__.py (5)

Line range hint 33-37: LGTM! Type hints improve code clarity.

The addition of type hints to rm_parentheses function makes the code more maintainable and helps with static type checking.


87-142: Excellent implementation of concurrent Wikipedia page fetching!

The concurrent implementation using ThreadPoolExecutor is a significant performance improvement. The code:

  • Maintains result order using index mapping
  • Properly handles failed requests
  • Includes fallback logic for keyword extraction

269-279: LGTM! Well-structured callback with proper error handling.

The callback method properly handles GUI display and context setting with appropriate error logging.


Line range hint 280-301: LGTM! Type hints and return value structure are clear.

The method signature and implementation are well-documented with type hints, making the code more maintainable.


201-205: 🛠️ Refactor suggestion

Rename variable 'l' to improve readability.

The variable name 'l' is ambiguous and could be confused with the number '1'.

Apply this diff to improve readability:

- for l in f.read().split("\n"):
-     if not l.strip() or l.startswith("#"):
-         continue
-     samples += expand_template(l)
+ for line in f.read().split("\n"):
+     if not line.strip() or line.startswith("#"):
+         continue
+     samples += expand_template(line)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

202-202: Ambiguous variable name: l

(E741)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
__init__.py (3)

87-142: Consider enhancing concurrent execution with timeout and specific error handling.

The concurrent implementation is good, but could be improved for robustness.

Consider these enhancements:

 def get_data(self, query: str, lang: Optional[str] = None, units: Optional[str] = None):
     """Fetch Wikipedia search results and detailed data concurrently."""
     LOG.debug(f"WikiSolver query: {query}")
     lang = (lang or self.default_lang).split("-")[0]
     search_url = (
         f"https://{lang}.wikipedia.org/w/api.php?action=query&list=search&"
         f"srsearch={query}&format=json"
     )

     try:
-        search_results = requests.get(search_url, timeout=5).json().get("query", {}).get("search", [])
+        response = requests.get(search_url, timeout=5)
+        response.raise_for_status()  # Raise exception for bad status codes
+        search_results = response.json().get("query", {}).get("search", [])
-    except Exception as e:
+    except requests.RequestException as e:
         LOG.error(f"Error fetching search results: {e}")
         search_results = []

     if not search_results:
         fallback_query = self.extract_keyword(query, lang)
         if fallback_query and fallback_query != query:
             LOG.debug(f"WikiSolver Fallback, new query: {fallback_query}")
             return self.get_data(fallback_query, lang=lang, units=units)
         return {}

     LOG.debug(f"Matched {len(search_results)} Wikipedia pages")

     # Prepare for parallel fetch and maintain original order
     summaries = [None] * len(search_results)  # List to hold results in original order
-    with concurrent.futures.ThreadPoolExecutor() as executor:
+    with concurrent.futures.ThreadPoolExecutor() as executor:
+        # Add timeout to prevent hanging on slow requests
+        future_to_idx = {
+            executor.submit(self.get_page_data, str(r["pageid"]), lang): idx
+            for idx, r in enumerate(search_results)
+            if "(disambiguation)" not in r["title"]
+        }
+        
+        try:
+            for future in concurrent.futures.as_completed(future_to_idx, timeout=30):
                 idx = future_to_idx[future]
                 title, ans, img = future.result()
                 if title and ans:
                     summaries[idx] = (title, ans, img)
+        except concurrent.futures.TimeoutError:
+            LOG.error("Timeout while fetching Wikipedia pages")
+            # Continue with partial results

Line range hint 179-205: Improve variable naming in file reading loop.

The code uses 'l' as a loop variable which is flagged by static analysis as ambiguous.

Apply this change:

 with open(filename) as f:
-    for l in f.read().split("\n"):
-        if not l.strip() or l.startswith("#"):
+    for line in f.read().split("\n"):
+        if not line.strip() or line.startswith("#"):
             continue
-        samples += expand_template(l)
+        samples += expand_template(line)
🧰 Tools
🪛 Ruff (0.8.2)

202-202: Ambiguous variable name: l

(E741)


Line range hint 369-459: Consider moving test cases to a dedicated test file.

The main execution block contains valuable test cases, but they would be better organized in a dedicated test file. This would:

  • Make the main module cleaner
  • Allow for proper test discovery and execution
  • Enable CI/CD integration

Would you like me to help create a separate test file with these test cases?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d16b325 and 599f452.

📒 Files selected for processing (1)
  • __init__.py (11 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
__init__.py

202-202: Ambiguous variable name: l

(E741)

🔇 Additional comments (3)
__init__.py (3)

12-28: LGTM! Good improvements to imports and type annotations.

The changes enhance type safety and add concurrent execution capabilities. The rm_parentheses function's type annotation is a good practice.

Also applies to: 33-37


66-85: LGTM! Well-structured page data retrieval with proper error handling.

The implementation correctly handles API responses and includes error handling.


188-198: LGTM! Good improvement in language support.

The use of closest_supported_match is a good addition for handling language directories robustly.

@JarbasAl JarbasAl marked this pull request as draft December 31, 2024 04:30
@JarbasAl JarbasAl marked this pull request as ready for review December 31, 2024 04:51
@github-actions github-actions bot added fix and removed fix labels Dec 31, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
__init__.py (1)

Line range hint 374-465: Move test cases to a dedicated test file.

The main execution block contains test cases that should be moved to a proper test file for better organization and maintainability.

Consider creating a new file test_wikipedia.py and moving all test cases there using a proper testing framework like pytest.

🧹 Nitpick comments (1)
__init__.py (1)

90-145: Consider adding timeout configuration for concurrent requests.

While the concurrent implementation is good, consider making the timeout value configurable and adding a limit to the ThreadPoolExecutor to prevent resource exhaustion.

 def get_data(self, query: str, lang: Optional[str] = None, units: Optional[str] = None):
+    TIMEOUT = self.config.get('request_timeout', 5)
+    MAX_WORKERS = self.config.get('max_workers', 5)
     LOG.debug(f"WikiSolver query: {query}")
     lang = (lang or self.default_lang).split("-")[0]
     search_url = (
         f"https://{lang}.wikipedia.org/w/api.php?action=query&list=search&"
         f"srsearch={query}&format=json"
     )

     try:
-        search_results = requests.get(search_url, timeout=5).json().get("query", {}).get("search", [])
+        search_results = requests.get(search_url, timeout=TIMEOUT).json().get("query", {}).get("search", [])
     except Exception as e:
         LOG.error(f"Error fetching search results: {e}")
         search_results = []

     if not search_results:
         fallback_query = self.extract_keyword(query, lang)
         if fallback_query and fallback_query != query:
             LOG.debug(f"WikiSolver Fallback, new query: {fallback_query}")
             return self.get_data(fallback_query, lang=lang, units=units)
         return {}

     LOG.debug(f"Matched {len(search_results)} Wikipedia pages")

     # Prepare for parallel fetch and maintain original order
     summaries = [None] * len(search_results)  # List to hold results in original order
-    with concurrent.futures.ThreadPoolExecutor() as executor:
+    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 599f452 and 628ddd5.

📒 Files selected for processing (2)
  • __init__.py (10 hunks)
  • requirements.txt (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
__init__.py

205-205: Ambiguous variable name: l

(E741)

🔇 Additional comments (4)
requirements.txt (1)

2-2: Verify compatibility with the updated ovos_workshop version.

The significant version bump from >=0.0.15 to >=3.3.0 might introduce breaking changes. Please ensure compatibility with other dependencies.

✅ Verification successful

Let me gather more information about the dependencies to verify compatibility.


Let me check the specific usage of ovos_workshop components to ensure compatibility with the new version.


Based on the analysis of the code and the version changes, I can now provide a final response:

The version upgrade to ovos_workshop>=3.3.0 is compatible with the codebase

The code review shows that the skill uses stable APIs from ovos_workshop:

  • intent_handler and common_query decorators
  • IntentBuilder for intent construction
  • OVOSSkill and CommonQuerySkill base classes

The latest release (3.3.0) only adds new features (common_query_decorator) and fixes (voc_match ignore case) without breaking changes. The skill's usage of ovos_workshop APIs remains compatible with the new version.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for potential version conflicts with other dependencies
gh api graphql -f query='
{
  repository(owner: "OpenVoiceOS", name: "OVOS-workshop") {
    releases(first: 1, orderBy: {field: CREATED_AT, direction: DESC}) {
      nodes {
        tagName
        description
        createdAt
      }
    }
  }
}'

Length of output: 871


Script:

#!/bin/bash
# Check for dependencies that might use ovos_workshop
rg -l "ovos_workshop" --type-add 'req:*requirements.txt' -treq

# Check imports and usage patterns
rg "from ovos_workshop|import ovos_workshop" -A 2

Length of output: 882


Script:

#!/bin/bash
# Check the content of __init__.py to see how the imported components are used
cat __init__.py

# Check the test files to understand the testing dependencies
cat test/unittests/test_skill.py
cat test/unittests/test_skill_loading.py

Length of output: 33496

__init__.py (3)

Line range hint 33-37: LGTM! Type hints improve code clarity.

The addition of type hints to rm_parentheses enhances code maintainability and IDE support.


66-88: LGTM! Robust error handling in page data fetching.

The implementation includes proper error handling and disambiguation page detection.


204-208: 🛠️ Refactor suggestion

Replace ambiguous variable name.

The variable name l is ambiguous and flagged by static analysis. Consider using a more descriptive name.

-                for l in f.read().split("\n"):
-                    if not l.strip() or l.startswith("#"):
+                for line in f.read().split("\n"):
+                    if not line.strip() or line.startswith("#"):
                         continue
-                    samples += expand_template(l)
+                    samples += expand_template(line)

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

205-205: Ambiguous variable name: l

(E741)

@JarbasAl JarbasAl merged commit e8cc9f5 into dev Jan 4, 2025
3 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant