Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows CPU-based execution #235

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Allows CPU-based execution #235

wants to merge 1 commit into from

Conversation

louiehelm
Copy link

Adds CPU execution to grok-1 model demo

VERY SLOW!

No one should process real world workloads this way.

This is only meant for early dev work by those who don't have 8 x 40GB GPUs

pip install -r requirements-cpu.txt
sed -i 's/USE_CPU_ONLY = False/USE_CPU_ONLY = True/' run.py
python run.py

Still requires:

  • 384GB RAM
  • 1.5 minutes to load into memory
  • 1.1 hours to "compile" grok-1 model
  • 4.2 hours to sample first inference request

Even on a 72 core Xeon Server, these runtimes can require monk-like patience.

So the point isn't to run this end-to-end all day.

It's for developers with high-memory workstations who would rather get this code running slowly than not at all.

Hopefully someone uses this CPU-only workaround early on to bootstrap grok-1 into a more performant model that can eventually be more accessible to a larger pool of devs.

Note: Executing this on most CPUs will emit a series of false warnings about the 8 CPU sub-processes being "stuck". These error messages come from a hardcoded warning within Tensorflow that don't appear to be tuneable or suppressible.

Note 2: If memory usage swells too high, comment out this single line below in checkpoint.py. This reduces peak memory usage from >600GB to closer to ~320GB. The downside is a slightly slower initial load. Adding this "copy_to_shm" load strategy is likely a good time-to-memory trade-off on xAI's server, but may not be on your workstation if it triggers OOM.

def fast_unpickle(path: str) -> Any:
  #  with copy_to_shm(path) as tmp_path:
        with open(path, "rb") as f:
            return pickle.load(f)

run.py Show resolved Hide resolved
@trholding
Copy link

Still requires:

  • 384GB RAM
  • 1.5 minutes to load into memory
  • 1.1 hours to "compile" grok-1 model
  • 4.2 hours to sample first inference request

Could you add your systems specs here?

I'll add it to: #42 and #183

@louiehelm
Copy link
Author

Still requires:

  • 384GB RAM
  • 1.5 minutes to load into memory
  • 1.1 hours to "compile" grok-1 model
  • 4.2 hours to sample first inference request

Could you add your systems specs here?

I'll add it to: #42 and #183

CPU: 2 x Intel Xeon E5-2697 v4
Total RAM: 1.5TB RAM

@louiehelm louiehelm requested a review from robvdl March 27, 2024 20:45
@inkoil
Copy link

inkoil commented Apr 3, 2024

I'm not sure why I got this error?
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:State sharding type: <class 'model.TrainingState'>
INFO:rank:(1, 256, 6144)
INFO:rank:(1, 256, 131072)
INFO:rank:Loading checkpoint at ./checkpoints/ckpt-0
INFO:rank:(1, 8192, 6144)
INFO:rank:(1, 8192, 131072)
Output for prompt: The answer to life the universe and everything is of course
INFO:runners:Precompile 1024
INFO:rank:(1, 1, 6144)
INFO:rank:(1, 1, 131072)
INFO:runners:Compiling...
INFO:rank:(1, 1, 6144)
INFO:rank:(1, 1, 131072)
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: unsupported operand type BF16 in op dot

I'm using Xeon 5320 + 1TB RAM.
install the software using the requirement-cpu.txt

@louiehelm
Copy link
Author

I'm not sure why I got this error?

...

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: unsupported operand type BF16 in op dot

I'm using Xeon 5320 + 1TB RAM. install the software using the requirement-cpu.txt

I assume you included my changes in run.py too? And changed "USE_CPU_ONLY = False" to "USE_CPU_ONLY = True"?

Hopefully this repository isn't abandoned but it doesn't seem like anyone is maintaining it anymore.

You might be better off running grok-1 in llama.cpp if JAX is crashing for you.

@pafend
Copy link

pafend commented Apr 7, 2024

For all those who read this and are struggleing but want to run this model once, here is an article on how I managed to get it run for less than $10.

If you want to test things, you might be better off using the more expensive GCP version because it offers the possiblity to be stopped and then you only pay for storage.

I hope someone finds it helpful.

Article:
https://twitter.com/PascalBauerDE/status/1776792056452546822
Fork:
https://github.com/pafend/grok-1-brev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants