Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 22] Invalid argument #526

Open
BladeTransformerLLC opened this issue Dec 30, 2024 · 5 comments
Open

OSError: [Errno 22] Invalid argument #526

BladeTransformerLLC opened this issue Dec 30, 2024 · 5 comments

Comments

@BladeTransformerLLC
Copy link

Hi there,

Based on the following article I've been trying to run gsplat + mcmc + bilagrid.
https://medium.com/@heyulei/align-cameras-in-realitycapture-and-integrate-with-gaussian-splatting-gsplat-4203aa080aa0

However, when running the command python simple_trainer.py mcmc --data_factor 1 --use_bilateral_grid --data_dir [data] --result_dir [out] I get the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError
Traceback (most recent call last):
  File "D:\Data\Programming\NeRFs\gsplat\examples\simple_trainer.py", line 1094, in <module>
    cli(main, cfg, verbose=True)
  File "C:\Python312\Lib\site-packages\gsplat-1.4.0-py3.12-win-amd64.egg\gsplat\distributed.py", line 360, in cli
    return _distributed_worker(0, 1, fn=fn, args=args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\site-packages\gsplat-1.4.0-py3.12-win-amd64.egg\gsplat\distributed.py", line 295, in _distributed_worker
    fn(local_rank, world_rank, world_size, args)
  File "D:\Data\Programming\NeRFs\gsplat\examples\simple_trainer.py", line 1039, in main
    runner.train()
  File "D:\Data\Programming\NeRFs\gsplat\examples\simple_trainer.py", line 551, in train
    trainloader_iter = iter(trainloader)
                       ^^^^^^^^^^^^^^^^^
  File "C:\Users\satyoshi\AppData\Roaming\Python\Python312\site-packages\torch\utils\data\dataloader.py", line 435, in __iter__
    self._iterator = self._get_iterator()
                     ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\satyoshi\AppData\Roaming\Python\Python312\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\satyoshi\AppData\Roaming\Python\Python312\site-packages\torch\utils\data\dataloader.py", line 1038, in __init__
    w.start()
  File "C:\Python312\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python312\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
OSError: [Errno 22] Invalid argument

My environment:
Windows 10
Python 3.12.6
PyTorch 2.4.1+cu118

@maturk
Copy link
Collaborator

maturk commented Dec 30, 2024

Looks like a python error, can you try a lower python version, like 3.10? I suggest using a conda environment, where you can control all dependencies and packages installed for reproducibility. Something like the following might work well:

# create conda environment
conda create --name gsplat -y python=3.10
conda activate gsplat
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit

# install gsplat from source
git clone --recurse-submodules https://github.com/nerfstudio-project/gsplat
cd gsplat/
pip install -e .

@BladeTransformerLLC
Copy link
Author

@maturk Thanks for the instruction. I reinstalled gsplat using conda but still get the same error.

Traceback (most recent call last):
  File "D:\Data\Programming\NeRFs\gsplat\examples\simple_trainer.py", line 1094, in <module>
    cli(main, cfg, verbose=True)
  File "d:\data\programming\nerfs\gsplat\gsplat\distributed.py", line 360, in cli
    return _distributed_worker(0, 1, fn=fn, args=args)
  File "d:\data\programming\nerfs\gsplat\gsplat\distributed.py", line 295, in _distributed_worker
    fn(local_rank, world_rank, world_size, args)
  File "D:\Data\Programming\NeRFs\gsplat\examples\simple_trainer.py", line 1039, in main
    runner.train()
  File "D:\Data\Programming\NeRFs\gsplat\examples\simple_trainer.py", line 551, in train
    trainloader_iter = iter(trainloader)
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\site-packages\torch\utils\data\dataloader.py", line 433, in __iter__
    self._iterator = self._get_iterator()
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\site-packages\torch\utils\data\dataloader.py", line 386, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\site-packages\torch\utils\data\dataloader.py", line 1039, in __init__
    w.start()
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
OSError: [Errno 22] Invalid argument

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\satyoshi\miniconda3\envs\gsplat\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

Also FYI I had to apply this fix to pycolmap before running simple_trainer.py:
#275

Perhaps I should try installing gsplat in WSL2.

@maturk
Copy link
Collaborator

maturk commented Dec 31, 2024

It is hard to debug, since I believe these are issues on the windows side (for both pycolmap and this error it seems). stackexchange suggests that this is a bug with pytorch on windows, and a fix is setting num_workers=1 instead of >1 for pytorch dataloading (that uses native python multiprocessing library in the backend). I cannot personally replicate this error, since I use a linux pc.

As a quick test, can you set this variable to 1 instead of 4:

num_workers=4,

@BladeTransformerLLC
Copy link
Author

@maturk Changing the variable to make it a single process worked (training started)...until it crashed in the middle of the training with the exact same error. I could use an EC2 instance w/ GPU (the only reason I've been using Windows for this is because I need to use RealityCapture).

@maturk
Copy link
Collaborator

maturk commented Jan 1, 2025

@BladeTransformerLLC, cool thanks for testing. Yes this confirms my belief that this problem stems from an external issue, PyTorch/Python on Windows, and not from any code specific issue within the gsplat library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants