Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Optimization and Benchmarking #2

Open
githuba9f5404 opened this issue Oct 5, 2024 · 6 comments
Open

Performance Optimization and Benchmarking #2

githuba9f5404 opened this issue Oct 5, 2024 · 6 comments

Comments

@githuba9f5404
Copy link

First let me preface this by the forewarning that I barely understand what I'm doing, but I have been successfully running distributed edge LLMs locally for awhile now. Apologies in advance for any technical inaccuracies in the below. So far I am compiling your project from scratch, intending to run on an array of 8 Raspberry Pi 4B 8GB (full hardware build can be found here: b4rtaz/distributed-llama#122). I have only currenty gotten as far as running the Llama 3.1 8B Instruct on just two raspi nodes, at an abysmal speed (estimating well over 1 minute a token), but it still ran which is something I have not been able to previously make work on just two nodes. This is extremely expressive. That is to say, however, that I am not very deep into using your project yet. So far for distributed edge LLMs, I have primarily been using been using the"distributed llama" project (found here: https://github.com/b4rtaz/distributed-llama), which relies on tensor parallelism and quantized models (some as low as 4bits). With some over clocking on the raspi array, I have been able to achieve running the Llama 3.1 8B Instruct at 2.14 tokens/sec (when using a highly quantized q4_0 version) on the 8 nodes. I'm also able to run the new Llama 3.2 1B at full F32 at a slightly slower speed of 2.04 tokens/sec on the same rig (using the same distributed llama project). There are a lot of similarities between your projects but I believe your approach is superior, as I'm able to already able to run models successfully that are too large for two nodes on that project (due to memory requirements even with the harsh 4 bit quantization) at full precision, when using your project. This is miraculous at any speed. I do think the distributed llama project is a bit more mature and there might be a few takeaways you might be able to learn from and add in from that project to increase the capabilities of your own (humbly). A couple of quick early observations about your project:

  • TPI-LLM does not appear to support the efficient use of multi cores CPUs (specified by a number of threads, this is --nthreads in the Distributed Llama project) is there a way to add multi-core support? I'm not seeing all cores in use when the project is 'thinking'.
  • TPI-LLM does not appear to make efficient use of the total available processing power of the CPU. For the project I'm used to running, my experience is first a 3-4 minute delay (as the work is spread out through the nodes, which happens each time) followed by consistent 100 percent CPU usage on all nodes as the array 'thinks' and 'responds'. For your project I do see the processor hitting 100 percent, but it doesn't stay there for long, I haven't observed more than a second or two at this point. This implies to me that the processing power is not being fully utilized.
  • How were the benchmarks calculated in your paper? Is there a quick 'benchmark' function that I'm missing? For the project I'm used to, this is a simple 'inference' feature that allows for quick and easy bench-marking. This feature would help quick comparison between minor configuration changes (such as overclocking or other optimizations).
  • While I understand the reasoning for avoiding quantization generally (as F16 would likely not result in a performance increase as it's not natively supported by many CPUs and is likely to be up-scaled to F32 anyways) the INT8 format should result in reasonable accuracy, is natively supported by most CPUs, and should allow for significant improvement over native precision on most lower powered edge devices. This would also allow for much quicker performance, and the ability to run even larger models on the lower powered devices (Llama 3.1 405B Instruct might be possible on my 8 x Raspberry Pi array with this change using your project). The format is also natively supported by PyTorch. How difficult would it be to implement INT8 support for this project?
  • Is there a direct 'chat' feature or an 'api' feature planned? Currently I'm achieving 6.13 tokens/sec for the smallest model I'm running (Llama 3.2 1B Instruct q4_0) using the distributed llama project which is more than enough for quick back and forth communication, although I still prefer interaction with the Llama 3.1 8B Instruct model q4_0 model at the slower 2.14 tokens/sec due to the increased depth of conversation it allows. Unless I'm missing this feature (definitely possible), there does not appear to be anyway currently to go back and forth with your project at this time.

I have read through your paper, and while I understand the gist of how you are achieving this, I'm not at a level of understanding where I truly understand the entire process, or LLMs in general, if I'm being fully honest. I'm just a guy with an array of eight raspberry pi's that wants to run LLMs locally on them. My hardware is pretty dedicated to running edge LLMs so if there is anything I can do to help with your project, please let me know.

Last, thank you for continuing to advance and push the boundaries of edge AI, as a novice hobbyist, your expert efforts are so greatly appreciated.

@Lizonghang
Copy link
Owner

Hello, @githuba9f5404, thanks for your recognition of our work and I deeply agree with you. Actually, although TPI-LLM is proven to be faster than benchmarks such as Transformers, Galaxy, etc., its speed is still not as fast as expected. Below I will share some of my practice in the form of Q&A and I hope it will be helpful to you.

  1. Why distributed-llama is so fast?

As you mentioned, distributed-llama has done a lot of work in computations. For example, the Q4 quantization, and their code is implemented in pure C, which requires a solid work. Instead, at this stage, the main purpose of this project is to validate the effectiveness of tensor parallelism and memory scheduling to support and speed up 70B-scale models, so no quantization or computational optimization is used. In addition, TPI-LLM is implemented based on PyTorch and Transformers, and most of the code is in Python, which is also one of the reasons why it is slow.

  1. Is there a way to add multi-core support?

Yes, as the computing engine of TPI-LLM is PyTorch, you can use all the acceleration techniques in PyTorch and Transformers to accelerate the computation, e.g., torch.set_num_threads(n).

  1. Why is my TPI-LLM reasoning so slow? (Also see the first question)

However, low CPU utilization is not necessarily due to multithreading, it may also be due to other reasons, such as communications and disk I/O. For communications, tensor parallelism requires frequent allreduce communications during each round of inference. If the link latency between your devices is high (e.g., > 10ms), the communication cost of allreduce communications will increase significantly. Since communications block computation, the CPU utilization will be 0% during allreduce communication.

For disk I/O, the magic of TPI-LLM to support arbitrarily large models is the sliding window memory scheduler, which means there will be frequent disk I/O to load weights from the SSD/HDD to the CPU mem. If your split files are placed on HDD, this will be a disaster because the disk I/O latency of HDD is very high. Secondly, even if your files are placed on SSD, if you have a wrong Docker setup, the actual disk I/O throughput will be very low. You can use the fio tool on Linux to verify the disk I/O throughput. If you, unfortunately, encounter one of the above two problems, memory scheduling will be very slow. Although a part of disk I/O latency will be overlapped by computations and communications, the inference is still keeping waiting for weight loading.

  1. How were the benchmarks calculated in the paper? Why I choose them?

The benchmarks are very popular, so I didn’t put them in this project. You can find them below. All of them are open source and easy to re-implement.

In my practice, none of them supports larger models. For example, I used 4 laptops to run them. Transformers and Accelerate will encounter OOM when running 7B models. Transformers use swap to exchange memory data, but the exchange is frequent and the swap space is limited, so it is very slow and prone to OOM. For Big Model Inference in Accelerate, theoretically speaking, it should be able to support very large models, but it fails because it needs to load the full weight files when slicing the weights and causes OOM before inference starts. And for Galaxy, it distributes model weights to multiple devices, but does not use memory scheduling, so the model size is limited.

The reason why I chose Transformers, Accelerate, and Galaxy is simple: TPI-LLM, Galaxy, and Accelerate are all implemented based on Transformers, and Galaxy uses tensor parallelism + sequence parallelism, Accelerate (Big Model Inference) also implements a memory scheduling method. This comparison is more fair because the benchmarks are built upon the same framework.

However, as you said, there is also a lot of very good work, such as distributed-llama, which I learned about after your recommendation. Actually, TPI-LLM and distributed-llama should be a good complement. The distributed-llama focuses on computation, while TPI-LLM focuses on communication and memory scheduling. In addition, the pure C implementation of distributed-llama also has a natural advantage in code execution efficiency. I would be very happy if Bart Tadych included our memory scheduler in their repo.

  1. Can we support INT8 in TPI-LLM?

I agree with you. In projects that are implemented in C/C++, such as llama.cpp and distributed-llama, they implement the INT8 format in computation and storage. However, TPI-LLM is implemented on PyTorch, so the answer to this question depends on whether PyTorch supports INT8. The answer may be frustrated. PyTorch currently seems to only support INT8 in CUDA. I may miss something, but if you find a solution, pls suggest it to me.

  1. Is there a direct 'chat' feature?

TPI-LLM is currently in its baby stage. It is only used for research purposes to verify the feasibility of the improvements, so it does not consider the optimization of user experience. As you mentioned, the primary issue facing TPI-LLM is that the inference speed, although much faster than benchmarks, is still not as fast as expected. So the chat mode is not an urgent matter at present. If we solved this problem, more work to improve the user experience will be included.

  1. What can I do now?

I am not sure what your final purpose is. Do you want to continue exploring some optimization techniques in distributed LLM serving? Or simply want to run a smaller LLM model locally at a high speed, such as Llama 3/3.1/3.2 8B? If it is the latter, I should recommend llama.cpp and llamafile to you.

llama.cpp: https://github.com/ggerganov/llama.cpp
llamafile: https://github.com/Mozilla-Ocho/llamafile

llama.cpp has a very fast speed on inferencing LLMs on user terminal devices, especially on Mac M1/M2. On my Mac M1, it can reach a speed of about 20 tokens/s with the help of Metal graphics, and supports rich quantization models using the GGUF format. You can use both the integrated GPU and CPU on one single device to support fast inference of 8B-scale models. llama.cpp also provides a chat GUI and API to meet your needs. llamafile has made more optimizations, including writing model weights in the executable file, which can provide faster inference speed than llama.cpp.

  1. What is our future plan?

As mentioned before, TPI-LLM is highly dependent on Python and PyTorch, which limits its code execution efficiency and quantization support. So we are migrating it to llama.cpp and making more effort to the inference speed and parallelism mechanism. It is expected that the improved version will support arbitrarily large LLMs and provide an inference speed of about 5 tokens/s. If we made it, I would be happy to announce it to you.

@githuba9f5404
Copy link
Author

Thank you so very much for the response, it will take me a while to consume completely. I am still just beginning my journey into edge AI and am at the point where I'm just enjoying playing with different projects, trying different things, and learning. For now, my path is one of simple curiosity, although I do have a secret goal of one day being able to run the Llama 3.1 405B on my 8 x Raspberry Pi array. To respond specifically:

  1. Acknowledged, thank you.
  2. I am still learning PyTorch, so I will have to consult the documentation. Appreciate the point in the right direction.
  3. To be clear, TPI-LLM isn't necessarily slower than distributed-llama, it just appears non-optimized, which as you have stated, has not been a focus. Apologies if my post came off as saying things were slow, not my intention at all. Running 3.1 8B with full precision on two raspberry pi nodes is INSANE at any speed, and when I am now using all 8 nodes, speed appears very reasonable for what it is (estimating about 1 token a second, although that is a very unscientific guess). Link latency should be near instantaneous on my setup, as it's a fully isolated system on a single 8 port gigabit switch. Disk read/write might be a factor, as they are all reading from the same 1TB SSD, both for the operating system (network booting) and model distribution, I will have to look into further. Most likely, the performance issues with just 2 nodes were a result of the need to constantly read the SSD with the sliding window memory scheduler, as you mentioned. I also have a separate array of four raspberry pi nodes where each has it's own SSD hard-drive, I will see if I can run a 4 x test on both architectures to determine if there is a performance improvement, likely using the smaller Llama 3.2 3B Instruct model for testing. Either way, this is very amazing work, thank you.
  4. Apologies if I was unclear. I'm looking more for "I would like to run your software, and have output that contains information about the speed things ran at so that I can make hardware/software configuration changes, and rerun the software again, and see how things compare". I think this might be supported in PyTorch, so I will follow that thread for now. Appreciate taking the time to educate me on the other options, I will be looking into each, and it will be helpful on my journey of learning and understanding.
  5. I believe PyTorch supports INT8 on x86 CPUs with FBGEMM (https://pytorch.org/FBGEMM/), and INT8 on ARM with QNNPACK (https://engineering.fb.com/2018/10/29/ml-applications/qnnpack/) which I think ended up integrated directly. I also believe that PyTorch support INT8 on CPUs within it's native quantization framework during Post Training Quantization (PTQ) processes. I want to again say I'm still at the beginning stages of learning all this and all of that could be just wrong.
  6. It might be easier to simply hook into an known api similar to open web-ui (https://github.com/open-webui/open-webui -- used by ollama -- https://github.com/ollama/ollama) to allow your software to work as a distributed backend only, over trying to recreate the wheel of a new UI on your own.
  7. Wonderful suggestions and I will pursue both, thank you. I am very much enjoying the challenge of trying to get the largest models possible running on the worst hardware (my laughable raspberry pi array), and (as soon as 3.1 70B has completed downloaded) your project may indeed be the winner there. Previously, Llama 2 13B has been the largest model able to be run on that hardware successfully, and I have every indication that the 70B model should likely be able to run.
  8. Would be very interested in anything and everything you do in the future, and again thank you.

@Lizonghang
Copy link
Owner

Hi, I tried distributed-llama today, on my testbed with 2 Mac laptops (1 Macbook Air with 4 cores and 8 GB of memory and 1 Macbook Pro M1 with 8 cores and 8GB of memory), they are connected via a local Wi-Fi router. The model used by distributed-llama is Llama-3-8B-Q40-Distributed-Llama (INT4), and the model used by TPI-LLM is Meta-Llama-3-8B-Instruct (FP32). The results are as follows:

distributed-llama:
🔶 G 9867 ms I 162 ms T 9705 ms S 1917200 kB R 272 kB how
🔶 G 10168 ms I 154 ms T 10014 ms S 272 kB R 272 kB are
🔶 G 10828 ms I 158 ms T 10670 ms S 272 kB R 272 kB you
🔶 G 10859 ms I 149 ms T 10709 ms S 272 kB R 272 kB ?
🔶 G 10996 ms I 151 ms T 10843 ms S 272 kB R 272 kB i
🔶 G 9798 ms I 153 ms T 9643 ms S 272 kB R 272 kB am
🔶 G 9566 ms I 152 ms T 9412 ms S 272 kB R 272 kB very
🔶 G 9687 ms I 172 ms T 9514 ms S 272 kB R 272 kB well
🔶 G 9476 ms I 160 ms T 9315 ms S 272 kB R 272 kB .
🔶 G 9601 ms I 157 ms T 9443 ms S 272 kB R 272 kB i
Generated tokens: 10
Avg tokens / second: 0.10
Avg generation time: 10084.60 ms
Avg inference time: 156.80 ms
Avg transfer time: 9926.80 ms

TPI-LLM:
Generate a token in 11.33 seconds
Generate a token in 12.52 seconds
Generate a token in 12.54 seconds
Generate a token in 12.04 seconds
Generate a token in 12.48 seconds
Generate a token in 12.47 seconds
Generate a token in 12.21 seconds
Generate a token in 12.05 seconds
Avg generation time: 12.20 s

In such a smart home case, multiple devices (e.g., laptops, pads, hand phones) are connected to one or more routers to exchange their results, and the wireless transmission process takes a high link latency (Avg transfer time: 9926.80 ms). Therefore, even with distributed-llama, we still need 10 seconds to generate one token. And this is why Bartłomiej Tadych stated in his post that, "Tensor parallelism speeds up inference, but synchronization slows it down. The combination of these two factors will determine the final performance. If you have 8 devices and can connect them with a fast link, you will observe a significant speedup (synchronization over USB4 seems very promising here, you can achieve from 10 to 20 Gbps)." However, in network edge, the link latency between devices is much higher than directly connecting them via USB or switches, so transfer delay becomes a significant bottleneck.

Actually, TPI-LLM has a similar token latency to distributed-llama, with a difference of only 2s/token. This is becuase distributed-llama uses Q4 quantization. Instead, TPI-LLM runs in full precision, without any performance loss from quantization. In addition, TPI-LLM supports larger-scale models with the help of sliding window memory scheduler. The distributed-llama also did it, with Llama 3.1 405B (Q4 quantization), but it uses 4 high-performance servers, each with 64 GB of memory and interconnected by a high-speed network, which I think, is not so satisfying for ordinary users who only have 1 low-price host or 1-4 laptops. So distributed-llama may be more suitable for traditional data centers, with gigabit network and abundant memory. This leads to differences in the application scenarios of TPI-LLM and distributed-llama.

@Lizonghang
Copy link
Owner

Forget to say, I also add comparisons on TPI-LLM and llama.cpp, pls see the table in README for details.

  1. If your device has rich graphics memory, llama.cpp is always the first choice.
  2. If your device has limited graphics memory, and the model size is smaller than 8B, try llama.cpp first. If the model size is large, e.g., 13B, 34B, 70B, try TPI-LLM.
  3. If your device does not have any GPUs (include the integrated GPUs like Apple Metal Graphics), but you have many such devices, TPI-LLM could be a good choice.
  4. If your devices have big memory and interconnected by a fast network, distributed-llama could be the first choice.

@githuba9f5404
Copy link
Author

I think that is why distributed llama is fast on my isolated system (eight raspberry pi devices in a cluster on a single gigabit switch) because of the very low link latency. The disk latency (since they are all pulling from the same SSD) is higher, which matters a lot more with TPI-LLM, due to the sliding window memory scheduler.

I didn't realize one of the test beds was running on wireless, getting any throughput on that is actually very, very impressive!

I have news though! I was able to run Llama 3.1 70B Instruct AT FULL PRECISION on the environment (eight ~$70 raspberry pi devices), using TPI-LLM which is just incredible! I have no idea what the speed it's actually generating at, I have not yet been able to determine how best to measure this, but it is running which is so very impressive! Incredible work!

@githuba9f5404
Copy link
Author

githuba9f5404 commented Oct 7, 2024

To clarify, this entire raspberry pi cluster cost about $800, which is less than a single 3090 graphics card and definitely less than the cost of a single macbook pro. It is incredible that this is able to run at all.

I believe an implementation of the sliding memory scheduling into distributed-llama would be amazing.

I have a secondary raspberry pi cluster where each node has it's own SSD, but I'm missing a USB to SATA cable that should arrive tomorrow. I'm excited to try to run larger models on this rig to compare speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants