Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(visual-prompt): Use transformer GroudingDino to replace third party code #29

Merged
merged 11 commits into from
Sep 11, 2024

Conversation

dandansamax
Copy link
Collaborator

@dandansamax dandansamax commented Sep 1, 2024

transformers v4.44 provides stable GroundingDino implementation.

This PR removes all third party code and use transformers instead.

close #14

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced a new module for object detection and OCR, enhancing image processing capabilities.
    • Added functions for detecting objects and text within images, providing more advanced visual recognition.
    • Implemented logging functionality with customizable log levels for better monitoring.
  • Bug Fixes

    • Improved code readability by updating conditional checks to follow best practices.
  • Chores

    • Updated project dependencies to newer versions, optimizing performance and compatibility.
    • Removed unnecessary dependencies to streamline the project.
  • Tests

    • Added unit tests for new visual prompt actions, ensuring functionality and reliability of image processing features.

@dandansamax dandansamax requested a review from WHALEEYE September 1, 2024 16:19
Copy link

coderabbitai bot commented Sep 1, 2024

Walkthrough

The changes involve updates across several Python files, including enhancements to linting directives, modifications to import statements, and the introduction of a new module for visual prompt actions. Key functions for object detection and optical character recognition (OCR) have been added, along with corresponding unit tests to ensure their functionality. Additionally, the pyproject.toml file has been revised to include new dependencies and updated versions, while performance monitoring capabilities have been integrated into various methods.

Changes

Files Change Summary
crab-benchmark-v0/dataset/android_subtasks.py, crab-benchmark-v0/dataset/handmade_tasks.py Updated conditional check for mail_node from == None to is None for improved readability; modified linting directives for ruff and exempted wildcard imports from linting rules.
crab-benchmark-v0/main.py Changed import statements to source functions from crab.actions.visual_prompt_actions instead of a local module; added logging functionality with a new command-line argument.
crab/actions/visual_prompt_actions.py Introduced new module with functions for object detection and OCR, including get_groundingdino_boxes and groundingdino_easyocr, along with several utility functions.
pyproject.toml Added setuptools dependency, updated versions for transformers and torch, and removed several outdated dependencies related to groundingdino.
test/actions/test_visual_prompt_actions.py Created new test file with unit tests for get_groundingdino_boxes and groundingdino_easyocr functions, ensuring functionality and validating outputs.
crab/agents/policies/single_agent.py, crab/core/benchmark.py, crab/core/environment.py Added @timed decorator to methods for performance monitoring, enhancing execution time tracking.
crab/utils/measure.py Introduced a new logging decorator timed to measure and log the execution time of functions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant VisualPromptActions
    participant ImageProcessing
    participant OCR

    User->>VisualPromptActions: Request object detection and OCR
    VisualPromptActions->>ImageProcessing: Call get_groundingdino_boxes
    ImageProcessing-->>VisualPromptActions: Return detected boxes
    VisualPromptActions->>OCR: Call get_easyocr_boxes
    OCR-->>VisualPromptActions: Return recognized text
    VisualPromptActions-->>User: Provide annotated image and text results
Loading

🐰 In the code, I hop and play,
With bounding boxes on display.
From images, I find the light,
With OCR, my vision's bright!
New functions bloom, oh what a sight!
Let's celebrate this coding night! 🌟

Assessment against linked issues

Objective Addressed Explanation
Change GroundingDino action to transformers version (#14)
Delete the third-party code from CRAB (#14)
Introduce transformers GroundingDino Model (#14)

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between faa56c0 and fc9d351.

Files selected for processing (1)
  • crab/actions/visual_prompt_actions.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • crab/actions/visual_prompt_actions.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Outside diff range, codebase verification and nitpick comments (3)
test/actions/test_visual_prompt_actions.py (2)

14-25: Improve the test by using a local test image and asserting the content of the result.

Consider the following improvements:

  1. Use a local test image instead of fetching it from a URL. This will make the test more reliable and faster.
  2. Assert the content of the result, not just the structure. This will make the test more comprehensive.

Apply this diff to improve the test:

 def test_get_groundingdino_boxes_single_image():
-    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-    image = Image.open(requests.get(url, stream=True).raw)
+    test_dir = Path(__file__).parent.parent
+    image_path = test_dir / "_assets" / "cat.jpg"
+    image = Image.open(image_path)
     text = "a cat."
 
     box_threshold = 0.4
     text_threshold = 0.3
     result = get_groundingdino_boxes(image, text, box_threshold, text_threshold)
     assert len(result) == 1
-    assert len(result[0]) > 0
-    assert len(result[0][0]) == 2
+    assert result[0] == [(100, 200), (300, 400)]

27-43: Improve the test by using local test images and asserting the content of the result.

Consider the following improvements:

  1. Use local test images instead of fetching them from URLs. This will make the test more reliable and faster.
  2. Assert the content of the result, not just the structure. This will make the test more comprehensive.

Apply this diff to improve the test:

 def test_get_groundingdino_boxes_multi_image():
-    url1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
-    url2 = "https://farm5.staticflickr.com/4005/4666183752_c5b79faa17_z.jpg"
-    image1 = Image.open(requests.get(url1, stream=True).raw)
-    image2 = Image.open(requests.get(url2, stream=True).raw)
+    test_dir = Path(__file__).parent.parent
+    image1_path = test_dir / "_assets" / "cat.jpg"
+    image2_path = test_dir / "_assets" / "car.jpg"
+    image1 = Image.open(image1_path)
+    image2 = Image.open(image2_path)
     text = "a cat. a car."
 
     box_threshold = 0.4
     text_threshold = 0.3
     result = get_groundingdino_boxes(
         [image1, image2], text, box_threshold, text_threshold
     )
     assert len(result) == 2
-    assert len(result[0]) > 0
-    assert len(result[1]) > 0
-    assert len(result[0][0]) == 2
+    assert result[0] == [(100, 200), (300, 400)]
+    assert result[1] == [(500, 600), (700, 800)]
pyproject.toml (1)

63-64: Potential Impact of Removing numpy Dependency

The numpy library is still being used in the following files, which suggests that its removal from the dependencies could affect the project's functionality:

  • crab-benchmark-v0/dataset/ubuntu_subtasks.py
  • crab/actions/visual_prompt_actions.py

Please verify whether these usages are necessary and consider re-adding numpy if required. No references to torchvision, addict, yapf, matplotlib, or pycocotools were found, indicating their removal might not impact the project.

Analysis chain

Removal of several dependencies.

Dependencies such as torchvision, numpy, addict, yapf, matplotlib, and pycocotools have been removed. This suggests a shift away from these libraries, likely due to the adoption of GroundingDino which might not require these. Confirm that the removal of these dependencies does not impact other parts of the project that might still rely on them.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify if removed dependencies are still referenced in the project.

# Test: Search for imports of the removed libraries in Python files.
rg --type python '(import|from) (torchvision|numpy|addict|yapf|matplotlib|pycocotools)'

Length of output: 120


Script:

#!/bin/bash
# Description: Verify if removed dependencies are still referenced in the project.

# Test: Search for imports of the removed libraries in Python files by specifying the file extension.
rg '(import|from) (torchvision|numpy|addict|yapf|matplotlib|pycocotools)' --glob '*.py' -A 2

Length of output: 628

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 69f48b6 and 90ba6b9.

Files ignored due to path filters (3)
  • poetry.lock is excluded by !**/*.lock
  • test/_assets/android_screenshot.png is excluded by !**/*.png
  • test/_assets/ubuntu_screenshot.png is excluded by !**/*.png
Files selected for processing (6)
  • crab-benchmark-v0/dataset/android_subtasks.py (1 hunks)
  • crab-benchmark-v0/dataset/handmade_tasks.py (2 hunks)
  • crab-benchmark-v0/main.py (2 hunks)
  • crab/actions/visual_prompt_actions.py (1 hunks)
  • pyproject.toml (2 hunks)
  • test/actions/test_visual_prompt_actions.py (1 hunks)
Files skipped from review due to trivial changes (3)
  • crab-benchmark-v0/dataset/android_subtasks.py
  • crab-benchmark-v0/dataset/handmade_tasks.py
  • crab-benchmark-v0/main.py
Additional comments not posted (8)
test/actions/test_visual_prompt_actions.py (1)

45-61: LGTM!

The code changes are approved.

pyproject.toml (2)

30-30: Addition of setuptools dependency.

The addition of setuptools with version ^73.0.1 is noted. This is a common dependency for managing packages and dependencies in Python, often required for building and installing modules. Ensure that this addition is necessary for the project's build process, especially since poetry is already being used as the build system and dependency manager.


63-64: Update of transformers and torch dependencies.

The versions of transformers and torch have been updated to 4.44.1 and ^2.4.0 respectively. This aligns with the PR's objective to use the latest stable versions for improved functionality and support. Ensure that these updates are compatible with other project dependencies and that no breaking changes affect the project's functionality.

crab/actions/visual_prompt_actions.py (5)

1-13: Review the license and copyright header.

The copyright and licensing information is clearly stated and follows standard practices. Ensure that this header is consistent with the project's licensing policies.


33-49: Review utility functions for geometric calculations.

The functions calculate_iou, calculate_size, and calculate_center are implemented correctly and efficiently. These functions are essential for handling bounding boxes in image processing tasks.


136-150: Review the get_grounding_dino_model function.

The function is well-documented and uses caching to avoid redundant model loading, which is a good practice for performance. The implementation correctly handles the model loading with a dynamic type selection. Ensure that the device variable is globally available as it is used here.


271-282: Review the get_elements_prompt action function.

The function constructs a user prompt based on detected elements. The implementation is straightforward and correctly formats the output. However, consider adding error handling for cases where the input might not be as expected.


14-22: Review the import statements.

The import statements are generally well-organized. However, the repeated import of torch in lines 16 and 26 is redundant and should be removed to clean up the code.

Suggest removing the duplicate import:

- import torch

Likely invalid or redundant comment.

Comment on lines 76 to 96
def filter_boxes_by_center(boxes_with_label, center_dis_thresh):
boxes = [box[0] for box in boxes_with_label]
boxes_to_remove = set()
for i in range(len(boxes)):
if i in boxes_to_remove:
continue
center_i = calculate_center(boxes[i])
for j in range(i + 1, len(boxes)):
center_j = calculate_center(boxes[j])
# fmt: off
center_close = ((center_i[0] - center_j[0]) ** 2 +
(center_i[1] - center_j[1]) ** 2 <
center_dis_thresh**2)
# fmt: on
if center_close:
boxes_to_remove.add(j)

boxes_filt = [
box for idx, box in enumerate(boxes_with_label) if idx not in boxes_to_remove
]
return boxes_filt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimize and refactor filter_boxes_by_center.

The function uses nested loops to compare box centers, which can be inefficient for a large number of boxes. Consider using a more efficient algorithm or data structure, such as a k-d tree, to handle spatial queries more efficiently.

Consider refactoring to use spatial data structures for efficiency:

from scipy.spatial import KDTree

def filter_boxes_by_center(boxes_with_label, center_dis_thresh):
    centers = [calculate_center(box[0]) for box in boxes_with_label]
    tree = KDTree(centers)
    boxes_to_remove = set()
    for i, center in enumerate(centers):
        if i in boxes_to_remove:
            continue
        # Query the tree for neighbors within the distance threshold
        indices = tree.query_ball_point(center, center_dis_thresh)
        for j in indices:
            if j > i:
                boxes_to_remove.add(j)
    return [box for idx, box in enumerate(boxes_with_label) if idx not in boxes_to_remove]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 90ba6b9 and f369114.

Files selected for processing (1)
  • crab/actions/visual_prompt_actions.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • crab/actions/visual_prompt_actions.py

box1Area = (box1[2] - box1[0]) * (box1[3] - box1[1])
box2Area = (box2[2] - box2[0]) * (box2[3] - box2[1])
unionArea = box1Area + box2Area - interArea
iou = interArea / unionArea

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unionArea could be 0 when one of the box area is zero or there is no intersection area between two boxes
if unionArea == 0: return 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The union area is the whole occupied area of of box1 and box2. So I think UnionArea==0 only happens when box1 and box2 are both 0, which may be impossible in our situation. But checking can be useful some time in the future. Thanks

Copy link
Contributor

@WHALEEYE WHALEEYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. Some type hints on functions (private, I know) would improve the readability tho



@action(local=True)
def get_elements_prompt(input: tuple[str, list[tuple[BoxType, str]]], env):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this, but is the env param here necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The env param is refer to the Environment object where the action is executed. It's like self in a class method.

@dandansamax
Copy link
Collaborator Author

Running new version on a CUDA machine will highly increse the inference speed. Here is a comparision:

GPU: RTX4500.

Old:

========================================
Start agent step 0:
2024-09-10 17:35:12,157 DEBUG -- Environment.observe ran in 0.18s with name ubuntu
2024-09-10 17:35:31,640 DEBUG -- Environment.observe_with_prompt ran in 19.66s with name ubuntu
2024-09-10 17:35:32,244 DEBUG -- Environment.observe ran in 0.6s with name android
2024-09-10 17:35:43,427 DEBUG -- Environment.observe_with_prompt ran in 11.79s with name android
2024-09-10 17:35:43,427 DEBUG -- Environment.observe ran in 0.0s with name root
2024-09-10 17:35:43,427 DEBUG -- Environment.observe_with_prompt ran in 0.0s with name root
2024-09-10 17:35:43,427 DEBUG -- Benchmark.observe_with_prompt ran in 31.45s with name ubuntu_android_benchmark
2024-09-10 17:35:54,046 DEBUG -- SingleAgentPolicy.chat ran in 9.61s
So agent take action: [ActionOutput(name='write_text', arguments={'text': 'restaurant around kaust'}, env='ubuntu')]
2024-09-10 17:35:55,911 DEBUG -- Benchmark.step ran in 1.86s with name ubuntu_android_benchmark
Action "write_text" in env "ubuntu" success. current evaluation results: {'total_nodes': 3, 'complete_nodes': 0, 'completeness': 0.0, 'completeness_per_action': 0.0, 'step_to_complete': 2, 'longest_unfinished_path_length': 2}

New:

========================================
Start agent step 0:
2024-09-10 18:25:29,325 DEBUG -- Environment.observe ran in 0.37s with name ubuntu
2024-09-10 18:26:10,451 DEBUG -- Environment.observe_with_prompt ran in 41.49s with name ubuntu
2024-09-10 18:26:11,126 DEBUG -- Environment.observe ran in 0.67s with name android
2024-09-10 18:26:12,115 DEBUG -- Environment.observe_with_prompt ran in 1.66s with name android
2024-09-10 18:26:12,115 DEBUG -- Environment.observe ran in 0.0s with name root
2024-09-10 18:26:12,115 DEBUG -- Environment.observe_with_prompt ran in 0.0s with name root
2024-09-10 18:26:12,115 DEBUG -- Benchmark.observe_with_prompt ran in 43.16s with name ubuntu_android_benchmark
2024-09-10 18:26:23,468 DEBUG -- SingleAgentPolicy.chat ran in 9.19s
So agent take action: [ActionOutput(name='search_application', arguments={'name': 'firefox'}, env='ubuntu')]
2024-09-10 18:26:26,936 DEBUG -- Benchmark.step ran in 3.47s with name ubuntu_android_benchmark
Action "search_application" in env "ubuntu" success. current evaluation results: {'total_nodes': 3, 'complete_nodes': 0, 'completeness': 0.0, 'completeness_per_action': 0.0, 'step_to_complete': 2, 'longest_unfinished_path_length': 2}

========================================
Start agent step 1:
2024-09-10 18:26:29,049 DEBUG -- Environment.observe ran in 0.11s with name ubuntu
2024-09-10 18:26:29,630 DEBUG -- Environment.observe_with_prompt ran in 0.69s with name ubuntu
2024-09-10 18:26:30,226 DEBUG -- Environment.observe ran in 0.6s with name android
2024-09-10 18:26:31,082 DEBUG -- Environment.observe_with_prompt ran in 1.45s with name android
2024-09-10 18:26:31,083 DEBUG -- Environment.observe ran in 0.0s with name root
2024-09-10 18:26:31,083 DEBUG -- Environment.observe_with_prompt ran in 0.0s with name root
2024-09-10 18:26:31,083 DEBUG -- Benchmark.observe_with_prompt ran in 2.14s with name ubuntu_android_benchmark
2024-09-10 18:26:40,518 DEBUG -- SingleAgentPolicy.chat ran in 8.48s
So agent take action: [ActionOutput(name='click', arguments={'element': 20}, env='ubuntu')]
2024-09-10 18:26:42,503 DEBUG -- Benchmark.step ran in 1.98s with name ubuntu_android_benchmark
Action "click" in env "ubuntu" success. current evaluation results: {'total_nodes': 3, 'complete_nodes': 0, 'completeness': 0.0, 'completeness_per_action': 0.0, 'step_to_complete': 2, 'longest_unfinished_path_length': 2}

========================================
Start agent step 2:
2024-09-10 18:26:44,688 DEBUG -- Environment.observe ran in 0.18s with name ubuntu
2024-09-10 18:26:45,589 DEBUG -- Environment.observe_with_prompt ran in 1.08s with name ubuntu
2024-09-10 18:26:46,327 DEBUG -- Environment.observe ran in 0.74s with name android
2024-09-10 18:26:47,234 DEBUG -- Environment.observe_with_prompt ran in 1.64s with name android
2024-09-10 18:26:47,234 DEBUG -- Environment.observe ran in 0.0s with name root
2024-09-10 18:26:47,234 DEBUG -- Environment.observe_with_prompt ran in 0.0s with name root
2024-09-10 18:26:47,234 DEBUG -- Benchmark.observe_with_prompt ran in 2.73s with name ubuntu_android_benchmark
2024-09-10 18:26:55,694 DEBUG -- SingleAgentPolicy.chat ran in 7.46s
So agent take action: [ActionOutput(name='click', arguments={'element': 16}, env='ubuntu')]
2024-09-10 18:26:57,675 DEBUG -- Benchmark.step ran in 1.98s with name ubuntu_android_benchmark
Action "click" in env "ubuntu" success. current evaluation results: {'total_nodes': 3, 'complete_nodes': 0, 'completeness': 0.0, 'completeness_per_action': 0.0, 'step_to_complete': 2, 'longest_unfinished_path_length': 2}

Copy link
Contributor

@WHALEEYE WHALEEYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx

@dandansamax dandansamax merged commit 3b36289 into main Sep 11, 2024
1 check passed
@dandansamax dandansamax deleted the refactor/remove-third-party-code branch September 11, 2024 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Change GroundingDino action to transformers version
3 participants