-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FBGEMM version mismatch on ARM #304
Comments
Hi @ayanchak1508 You can just remove the version requirement in this file locally which should be inside $HOME/repos/mlcommons@cm4mlops/script/ We never had success using a higher version of fbgemm with the available inference implementation. If you can share the exact versions which worked, we can test them. |
Thanks for the quick reply! These are the versions (that changed from the default) that work for me: I have attached the full requirements.txt file in case if needed I sometimes run into a |
Thanks a lot @ayanchak1508 . Let me check that. This issue might help with the bus error. |
yes, with pytorch 2.4 we could use fbgemm_gpu==0.8.0 and it worked fine. We have removed the version dependency in the CM script now. You can just do |
Just to add |
Thanks a lot for the quick updates! I did a fresh, clean setup to see the effects. I have two observations:
I'm not sure if I'm doing anything wrong, but if I create a new virtual environment and use the requirements file I posted earlier, the benchmark runs without problems. Maybe this is an ARM-specific problem? Regarding the I guess one possible solution could be to edit the conf file manually, but is there a better way? |
For 1, may be the problem is with the .whl file? "but if I create a new virtual environment and use the requirements file I posted earlier, the benchmark runs without problems." Is it on the same ARM machine? If so, you can try the venv for CM flow also as follows:
For the bus error - what's the available RAM on the system? |
Sorry, I should have been more specific. Runs are on a clean and empty docker container (ubuntu:22.04) on an ARM server. I created two python venvs (in the same container), one for installing packages through the CM-based flow and one for installing packages from the requirements file. Although I didn't use the command you mentioned, I simply created a normal python venv as mentioned here: https://docs.mlcommons.org/inference/install/ and ran the CM commands for the benchmark there. Does the command you mentioned do something more? For the bus error, the RAM is not too big, it's about ~250GB (the docker container has no resource constraints). I remember I faced a similar problem before when I processed the dataset myself some time back, and had to move to a different machine with 512 GB RAM. So, I understand maybe its not big enough to run the entire dataset, but should be fine at least for the debug dataset? |
Thank you. Yes, the commands are a bit different. Coming to 256GB, it should be good enough. We have run Dlrmv2 full comfortably on 192GB. It worked even on 64GB, but had to use a lot of swap space. I believe your problem could be the shm size as docker is used. Are you explicitly setting shm size during docker run? We typically set 32GB shm size for dlrm. |
Thank you very much for the clarification! I did not set the shm size, and the default seems to be 64MB, much smaller than the 32GB you mentioned. Thanks once again for all the quick help. |
Sure @ayanchak1508 Just a correction to what I told earlier - the 64G system where we had run dlrmv2 was on GPUs and not CPUs. On CPUs we could only do a test run on 192G for 10 inputs. |
Update:
These are the packages it installs in the
I am not sure why I had to reinstall the same version of |
Sorry @ayanchak1508 I missed replying you. We now have github action for DLRMv2 CPU run and you can see the logs. The recent ones are failures due to a change in the inference code which is fixed now. Please let us know if your problems are solved. |
I was trying to run the DLRMv2 benchmark of MLPerf Inference on an ARM server using the instructions here.
I run into the issue when the tool tries to install
torchrec==0.3.2
torchrec==0.3.2
requiresfbgemm-gpu==0.3.2
butfbgemm-gpu
only introduced support for ARM starting from v0.5.0: https://download.pytorch.org/whl/cpu/fbgemm-gpu/I tried two alternate approaches:
fbgemm-gpu
(v0.5.0 or above) but thecm
tool remains inflexible and keeps trying to search for v0.3.2Previously, I did run the benchmark without any problems on ARM (without using the
cm
tool) using newer versions offbgemm-gpu
. (Note that I did need to usefbgemm-gpu-cpu
too)Command to reproduce the issue:
Error message:
The repro folder and the logfile is present in the attached tarball.
cm-repro.tar.gz
The text was updated successfully, but these errors were encountered: