Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce (experimental) DALI proxy #5726

Merged
merged 24 commits into from
Jan 14, 2025
Merged

Conversation

jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented Nov 27, 2024

Category:

New feature

Description:

  • DALI proxy is a new way to integrate DALI pipelines with existing torch data loading pipelines
  • The idea is that torch data processes send data to the main process, where it is processed by DALI before handling it over to the training loop.
  • The solution allows for mixing data loading from Pytorch with partial processing on DALI

Co-author: @mdabek-nvidia

Additional information:

Affected modules and functionalities:

  • Torch plugin

Key points relevant for the review:

dali/python/nvidia/dali/plugin/pytorch/proxy/__init__.py

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • [?] RST
    • [?] Jupyter
    • Other
  • N/A
    TODO

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [20865442]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [20865442]: BUILD PASSED

@jantonguirao jantonguirao changed the title Dali proxy2 Introduce DALI proxy Nov 28, 2024
@jantonguirao jantonguirao marked this pull request as ready for review November 28, 2024 09:21
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
@szkarpinski
Copy link
Collaborator

I see there are no tests except for the resnet50 example. I believe we should have normal TL0 tests as well.

@szkarpinski
Copy link
Collaborator

Did you test error propagation between the processes? Multiprocessing doesn't automagically propagate exceptions afaik. Maybe we should have tests to check how are the errors in particular processes reported?

@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 2 times, most recently from caf2d0b to 0472490 Compare November 29, 2024 10:28
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting few comments as the code started moving.
I also feel like we should limit the scope of the API and hide most of the implementation.

dali/python/setup.py.in Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 7 times, most recently from c4a7f74 to 1a50983 Compare December 3, 2024 17:52
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21050826]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21050826]: BUILD FAILED

@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 2 times, most recently from 3bdfd37 to 014032d Compare December 3, 2024 18:32
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21052236]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21052236]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [22445006]: BUILD PASSED

@jantonguirao jantonguirao requested a review from klecki January 14, 2025 10:59
Signed-off-by: Joaquin Anton Guirao <[email protected]>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [22484757]: BUILD STARTED

Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Comment on lines 616 to 619
if not DALIServer._needs_conversion(obj):
result = obj
# Named tuple: Reconstruct using `_replace`
elif hasattr(obj, "_replace") and hasattr(obj, "_fields"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not DALIServer._needs_conversion(obj):
result = obj
# Named tuple: Reconstruct using `_replace`
elif hasattr(obj, "_replace") and hasattr(obj, "_fields"):
# Named tuple: Reconstruct using `_replace`
if hasattr(obj, "_replace") and hasattr(obj, "_fields"):

self._cache_outputs[curr_batch_id] = curr_processed_outputs
return req_outputs

def _need_conversion(obj, need_conversion_cache):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _need_conversion(obj, need_conversion_cache):
@staticmethod
def _need_conversion(obj, need_conversion_cache):

Signed-off-by: Joaquin Anton Guirao <[email protected]>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [22487905]: BUILD STARTED

klecki
klecki previously approved these changes Jan 14, 2025
@klecki klecki dismissed their stale review January 14, 2025 14:44

Comments were addressed but I did only partial review.

@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [22491477]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [22491477]: BUILD PASSED

@jantonguirao jantonguirao merged commit bc85d25 into NVIDIA:main Jan 14, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants