Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FBGEMM version mismatch on ARM #304

Open
ayanchak1508 opened this issue Sep 27, 2024 · 13 comments
Open

FBGEMM version mismatch on ARM #304

ayanchak1508 opened this issue Sep 27, 2024 · 13 comments

Comments

@ayanchak1508
Copy link

I was trying to run the DLRMv2 benchmark of MLPerf Inference on an ARM server using the instructions here.

I run into the issue when the tool tries to install torchrec==0.3.2
torchrec==0.3.2 requires fbgemm-gpu==0.3.2 but fbgemm-gpu only introduced support for ARM starting from v0.5.0: https://download.pytorch.org/whl/cpu/fbgemm-gpu/

I tried two alternate approaches:

  1. Build fbgemm-gpu v0.3.2 from source. This does not work because it needs a compiler with AVX-512 support (which is clearly absent on ARM).
  2. Try with a newer version of fbgemm-gpu (v0.5.0 or above) but the cm tool remains inflexible and keeps trying to search for v0.3.2

Previously, I did run the benchmark without any problems on ARM (without using the cm tool) using newer versions of fbgemm-gpu. (Note that I did need to use fbgemm-gpu-cpu too)

Command to reproduce the issue:

cm run script --tags=run-mlperf,inference,_r4.1-dev    --model=dlrm-v2-99.9    --implementation=reference    --framework=pytorch    --category=datacenter    --scenario=Server   --server_target_qps=10    --execution_mode=valid    --device=cpu    --quiet --repro

Error message:

ERROR: Could not find a version that satisfies the requirement fbgemm-gpu==0.3.2 (from versions: none)
ERROR: No matching distribution found for fbgemm-gpu==0.3.2

The repro folder and the logfile is present in the attached tarball.
cm-repro.tar.gz

@arjunsuresh
Copy link
Contributor

Hi @ayanchak1508 You can just remove the version requirement in this file locally which should be inside $HOME/repos/mlcommons@cm4mlops/script/

https://github.com/GATEOverflow/cm4mlops/blob/mlperf-inference/script/app-mlperf-inference-mlcommons-python/_cm.yaml#L1129

We never had success using a higher version of fbgemm with the available inference implementation. If you can share the exact versions which worked, we can test them.

@ayanchak1508
Copy link
Author

Thanks for the quick reply!
Yes, indeed after changing the version, it seems to be working.

These are the versions (that changed from the default) that work for me:
fbgemm_gpu==0.8.0+cpu
fbgemm_gpu-cpu==0.8.0
torch==2.4.0
torchrec==0.8.0

I have attached the full requirements.txt file in case if needed
requirements.txt

I sometimes run into a bus error (core dumped) error afterward, but it seems to be more of a memory capacity issue unrelated to the toolchain/benchmark?

@arjunsuresh
Copy link
Contributor

Thanks a lot @ayanchak1508 . Let me check that. This issue might help with the bus error.

@arjunsuresh
Copy link
Contributor

yes, with pytorch 2.4 we could use fbgemm_gpu==0.8.0 and it worked fine. We have removed the version dependency in the CM script now. You can just do cm pull repo and it should be visible.

@arjunsuresh
Copy link
Contributor

Just to add ulimit=9999 was not enough to run 1000 inputs. I think it'll be incredibly hard to do a full run of 204800 inputs using the current reference implementation on CPUs.

@ayanchak1508
Copy link
Author

ayanchak1508 commented Sep 27, 2024

Thanks a lot for the quick updates!

I did a fresh, clean setup to see the effects. I have two observations:

  1. pip doesn't automatically know where to find fbgemm-gpu for ARM, it needs to be installed via pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cpu/
  2. I actually ran into more dependency conflicts this time, and the benchmark started complaining about functions it couldn't find inside modules (such as ModuleNotFoundError: No module named 'fbgemm_gpu.split_embedding_configs')

I'm not sure if I'm doing anything wrong, but if I create a new virtual environment and use the requirements file I posted earlier, the benchmark runs without problems. Maybe this is an ARM-specific problem?

Regarding the bus error problem, thank you again for the references. Is there any way to use the debug dataset or limit the max inputs, i.e., deviate from the official submission rules in any way? (of course I understand it wouldn't count as a valid submission, but I'm just interested in the model performance)

I guess one possible solution could be to edit the conf file manually, but is there a better way?
(Sorry for bringing the bus error into this issue, we can move it to a separate issue if needed)

@arjunsuresh
Copy link
Contributor

For 1, may be the problem is with the .whl file?

"but if I create a new virtual environment and use the requirements file I posted earlier, the benchmark runs without problems."

Is it on the same ARM machine? If so, you can try the venv for CM flow also as follows:

cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

For the bus error - what's the available RAM on the system?

@ayanchak1508
Copy link
Author

Sorry, I should have been more specific. Runs are on a clean and empty docker container (ubuntu:22.04) on an ARM server.

I created two python venvs (in the same container), one for installing packages through the CM-based flow and one for installing packages from the requirements file. Although I didn't use the command you mentioned, I simply created a normal python venv as mentioned here: https://docs.mlcommons.org/inference/install/ and ran the CM commands for the benchmark there. Does the command you mentioned do something more?

For the bus error, the RAM is not too big, it's about ~250GB (the docker container has no resource constraints). I remember I faced a similar problem before when I processed the dataset myself some time back, and had to move to a different machine with 512 GB RAM. So, I understand maybe its not big enough to run the entire dataset, but should be fine at least for the debug dataset?

@arjunsuresh
Copy link
Contributor

Thank you.

Yes, the commands are a bit different. CM is a python package and when you use a venv for CM, it gets installed in the venv. Now when you run any workflow using CM, any available python on the system can be picked by the flow unless we force one using "cm run script - -tags=get,python" and doing the appropriate selection. The command I shared is a safer option as long as the name used is new.

Coming to 256GB, it should be good enough. We have run Dlrmv2 full comfortably on 192GB. It worked even on 64GB, but had to use a lot of swap space.

I believe your problem could be the shm size as docker is used. Are you explicitly setting shm size during docker run? We typically set 32GB shm size for dlrm.

@ayanchak1508
Copy link
Author

Thank you very much for the clarification!

I did not set the shm size, and the default seems to be 64MB, much smaller than the 32GB you mentioned.
I will try it out (both using the command you mentioned and increasing the shm size), and get back to you.

Thanks once again for all the quick help.

@arjunsuresh
Copy link
Contributor

Sure @ayanchak1508 Just a correction to what I told earlier - the 64G system where we had run dlrmv2 was on GPUs and not CPUs. On CPUs we could only do a test run on 192G for 10 inputs.

@ayanchak1508
Copy link
Author

Update:

  1. Increasing the shm size to 32G fixes the bus error problem, thank you! I can now run the benchmark, albeit with a very low qps
  2. Using the CM venv flow as you mentioned before doesn't help, it runs into the same problems:
ImportError: cannot import name 'DLRM_DCN' from 'torchrec.models.dlrm' (/root/CM/repos/local/cache/b1d060ef5c0c4217/mlperf/lib/python3.10/site-packages/torchrec/models/dlrm.py)
ModuleNotFoundError: No module named 'fbgemm_gpu.split_embedding_configs'

These are the packages it installs in the mlperf venv: current.txt
Doing a diff with the requirements file I posted before, and then manually installing the correct package versions in the mlperf venv solves the problem:

pip install torch==2.4.0 torchrec==0.8.0
pip uninstall fbgemm-gpu
pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cpu/

I am not sure why I had to reinstall the same version of fbgemm-gpu but otherwise it runs into the ModuleNotFoundError

@arjunsuresh
Copy link
Contributor

Sorry @ayanchak1508 I missed replying you. We now have github action for DLRMv2 CPU run and you can see the logs. The recent ones are failures due to a change in the inference code which is fixed now. Please let us know if your problems are solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants