-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allows CPU-based execution #235
base: main
Are you sure you want to change the base?
Conversation
I'm not sure why I got this error? jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: unsupported operand type BF16 in op dot I'm using Xeon 5320 + 1TB RAM. |
I assume you included my changes in run.py too? And changed "USE_CPU_ONLY = False" to "USE_CPU_ONLY = True"? Hopefully this repository isn't abandoned but it doesn't seem like anyone is maintaining it anymore. You might be better off running grok-1 in llama.cpp if JAX is crashing for you. |
For all those who read this and are struggleing but want to run this model once, here is an article on how I managed to get it run for less than $10. If you want to test things, you might be better off using the more expensive GCP version because it offers the possiblity to be stopped and then you only pay for storage. I hope someone finds it helpful. Article: |
Adds CPU execution to grok-1 model demo
VERY SLOW!
No one should process real world workloads this way.
This is only meant for early dev work by those who don't have 8 x 40GB GPUs
pip install -r requirements-cpu.txt sed -i 's/USE_CPU_ONLY = False/USE_CPU_ONLY = True/' run.py python run.py
Still requires:
Even on a 72 core Xeon Server, these runtimes can require monk-like patience.
So the point isn't to run this end-to-end all day.
It's for developers with high-memory workstations who would rather get this code running slowly than not at all.
Hopefully someone uses this CPU-only workaround early on to bootstrap grok-1 into a more performant model that can eventually be more accessible to a larger pool of devs.
Note: Executing this on most CPUs will emit a series of false warnings about the 8 CPU sub-processes being "stuck". These error messages come from a hardcoded warning within Tensorflow that don't appear to be tuneable or suppressible.
Note 2: If memory usage swells too high, comment out this single line below in checkpoint.py. This reduces peak memory usage from >600GB to closer to ~320GB. The downside is a slightly slower initial load. Adding this "copy_to_shm" load strategy is likely a good time-to-memory trade-off on xAI's server, but may not be on your workstation if it triggers OOM.