-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Optimization and Benchmarking #2
Comments
Hello, @githuba9f5404, thanks for your recognition of our work and I deeply agree with you. Actually, although TPI-LLM is proven to be faster than benchmarks such as Transformers, Galaxy, etc., its speed is still not as fast as expected. Below I will share some of my practice in the form of Q&A and I hope it will be helpful to you.
As you mentioned, distributed-llama has done a lot of work in computations. For example, the Q4 quantization, and their code is implemented in pure C, which requires a solid work. Instead, at this stage, the main purpose of this project is to validate the effectiveness of tensor parallelism and memory scheduling to support and speed up 70B-scale models, so no quantization or computational optimization is used. In addition, TPI-LLM is implemented based on PyTorch and Transformers, and most of the code is in Python, which is also one of the reasons why it is slow.
Yes, as the computing engine of TPI-LLM is PyTorch, you can use all the acceleration techniques in PyTorch and Transformers to accelerate the computation, e.g., torch.set_num_threads(n).
However, low CPU utilization is not necessarily due to multithreading, it may also be due to other reasons, such as communications and disk I/O. For communications, tensor parallelism requires frequent allreduce communications during each round of inference. If the link latency between your devices is high (e.g., > 10ms), the communication cost of allreduce communications will increase significantly. Since communications block computation, the CPU utilization will be 0% during allreduce communication. For disk I/O, the magic of TPI-LLM to support arbitrarily large models is the sliding window memory scheduler, which means there will be frequent disk I/O to load weights from the SSD/HDD to the CPU mem. If your split files are placed on HDD, this will be a disaster because the disk I/O latency of HDD is very high. Secondly, even if your files are placed on SSD, if you have a wrong Docker setup, the actual disk I/O throughput will be very low. You can use the
The benchmarks are very popular, so I didn’t put them in this project. You can find them below. All of them are open source and easy to re-implement.
In my practice, none of them supports larger models. For example, I used 4 laptops to run them. Transformers and Accelerate will encounter OOM when running 7B models. Transformers use swap to exchange memory data, but the exchange is frequent and the swap space is limited, so it is very slow and prone to OOM. For Big Model Inference in Accelerate, theoretically speaking, it should be able to support very large models, but it fails because it needs to load the full weight files when slicing the weights and causes OOM before inference starts. And for Galaxy, it distributes model weights to multiple devices, but does not use memory scheduling, so the model size is limited. The reason why I chose Transformers, Accelerate, and Galaxy is simple: TPI-LLM, Galaxy, and Accelerate are all implemented based on Transformers, and Galaxy uses tensor parallelism + sequence parallelism, Accelerate (Big Model Inference) also implements a memory scheduling method. This comparison is more fair because the benchmarks are built upon the same framework. However, as you said, there is also a lot of very good work, such as distributed-llama, which I learned about after your recommendation. Actually, TPI-LLM and distributed-llama should be a good complement. The distributed-llama focuses on computation, while TPI-LLM focuses on communication and memory scheduling. In addition, the pure C implementation of distributed-llama also has a natural advantage in code execution efficiency. I would be very happy if Bart Tadych included our memory scheduler in their repo.
I agree with you. In projects that are implemented in C/C++, such as llama.cpp and distributed-llama, they implement the INT8 format in computation and storage. However, TPI-LLM is implemented on PyTorch, so the answer to this question depends on whether PyTorch supports INT8. The answer may be frustrated. PyTorch currently seems to only support INT8 in CUDA. I may miss something, but if you find a solution, pls suggest it to me.
TPI-LLM is currently in its baby stage. It is only used for research purposes to verify the feasibility of the improvements, so it does not consider the optimization of user experience. As you mentioned, the primary issue facing TPI-LLM is that the inference speed, although much faster than benchmarks, is still not as fast as expected. So the chat mode is not an urgent matter at present. If we solved this problem, more work to improve the user experience will be included.
I am not sure what your final purpose is. Do you want to continue exploring some optimization techniques in distributed LLM serving? Or simply want to run a smaller LLM model locally at a high speed, such as Llama 3/3.1/3.2 8B? If it is the latter, I should recommend llama.cpp and llamafile to you. llama.cpp: https://github.com/ggerganov/llama.cpp llama.cpp has a very fast speed on inferencing LLMs on user terminal devices, especially on Mac M1/M2. On my Mac M1, it can reach a speed of about 20 tokens/s with the help of Metal graphics, and supports rich quantization models using the GGUF format. You can use both the integrated GPU and CPU on one single device to support fast inference of 8B-scale models. llama.cpp also provides a chat GUI and API to meet your needs. llamafile has made more optimizations, including writing model weights in the executable file, which can provide faster inference speed than llama.cpp.
As mentioned before, TPI-LLM is highly dependent on Python and PyTorch, which limits its code execution efficiency and quantization support. So we are migrating it to llama.cpp and making more effort to the inference speed and parallelism mechanism. It is expected that the improved version will support arbitrarily large LLMs and provide an inference speed of about 5 tokens/s. If we made it, I would be happy to announce it to you. |
Thank you so very much for the response, it will take me a while to consume completely. I am still just beginning my journey into edge AI and am at the point where I'm just enjoying playing with different projects, trying different things, and learning. For now, my path is one of simple curiosity, although I do have a secret goal of one day being able to run the Llama 3.1 405B on my 8 x Raspberry Pi array. To respond specifically:
|
Hi, I tried distributed-llama today, on my testbed with 2 Mac laptops (1 Macbook Air with 4 cores and 8 GB of memory and 1 Macbook Pro M1 with 8 cores and 8GB of memory), they are connected via a local Wi-Fi router. The model used by distributed-llama is Llama-3-8B-Q40-Distributed-Llama (INT4), and the model used by TPI-LLM is Meta-Llama-3-8B-Instruct (FP32). The results are as follows: distributed-llama: TPI-LLM: In such a smart home case, multiple devices (e.g., laptops, pads, hand phones) are connected to one or more routers to exchange their results, and the wireless transmission process takes a high link latency (Avg transfer time: 9926.80 ms). Therefore, even with distributed-llama, we still need 10 seconds to generate one token. And this is why Bartłomiej Tadych stated in his post that, "Tensor parallelism speeds up inference, but synchronization slows it down. The combination of these two factors will determine the final performance. If you have 8 devices and can connect them with a fast link, you will observe a significant speedup (synchronization over USB4 seems very promising here, you can achieve from 10 to 20 Gbps)." However, in network edge, the link latency between devices is much higher than directly connecting them via USB or switches, so transfer delay becomes a significant bottleneck. Actually, TPI-LLM has a similar token latency to distributed-llama, with a difference of only 2s/token. This is becuase distributed-llama uses Q4 quantization. Instead, TPI-LLM runs in full precision, without any performance loss from quantization. In addition, TPI-LLM supports larger-scale models with the help of sliding window memory scheduler. The distributed-llama also did it, with Llama 3.1 405B (Q4 quantization), but it uses 4 high-performance servers, each with 64 GB of memory and interconnected by a high-speed network, which I think, is not so satisfying for ordinary users who only have 1 low-price host or 1-4 laptops. So distributed-llama may be more suitable for traditional data centers, with gigabit network and abundant memory. This leads to differences in the application scenarios of TPI-LLM and distributed-llama. |
Forget to say, I also add comparisons on TPI-LLM and llama.cpp, pls see the table in README for details.
|
I think that is why distributed llama is fast on my isolated system (eight raspberry pi devices in a cluster on a single gigabit switch) because of the very low link latency. The disk latency (since they are all pulling from the same SSD) is higher, which matters a lot more with TPI-LLM, due to the sliding window memory scheduler. I didn't realize one of the test beds was running on wireless, getting any throughput on that is actually very, very impressive! I have news though! I was able to run Llama 3.1 70B Instruct AT FULL PRECISION on the environment (eight ~$70 raspberry pi devices), using TPI-LLM which is just incredible! I have no idea what the speed it's actually generating at, I have not yet been able to determine how best to measure this, but it is running which is so very impressive! Incredible work! |
To clarify, this entire raspberry pi cluster cost about $800, which is less than a single 3090 graphics card and definitely less than the cost of a single macbook pro. It is incredible that this is able to run at all. I believe an implementation of the sliding memory scheduling into distributed-llama would be amazing. I have a secondary raspberry pi cluster where each node has it's own SSD, but I'm missing a USB to SATA cable that should arrive tomorrow. I'm excited to try to run larger models on this rig to compare speed. |
First let me preface this by the forewarning that I barely understand what I'm doing, but I have been successfully running distributed edge LLMs locally for awhile now. Apologies in advance for any technical inaccuracies in the below. So far I am compiling your project from scratch, intending to run on an array of 8 Raspberry Pi 4B 8GB (full hardware build can be found here: b4rtaz/distributed-llama#122). I have only currenty gotten as far as running the Llama 3.1 8B Instruct on just two raspi nodes, at an abysmal speed (estimating well over 1 minute a token), but it still ran which is something I have not been able to previously make work on just two nodes. This is extremely expressive. That is to say, however, that I am not very deep into using your project yet. So far for distributed edge LLMs, I have primarily been using been using the"distributed llama" project (found here: https://github.com/b4rtaz/distributed-llama), which relies on tensor parallelism and quantized models (some as low as 4bits). With some over clocking on the raspi array, I have been able to achieve running the Llama 3.1 8B Instruct at 2.14 tokens/sec (when using a highly quantized q4_0 version) on the 8 nodes. I'm also able to run the new Llama 3.2 1B at full F32 at a slightly slower speed of 2.04 tokens/sec on the same rig (using the same distributed llama project). There are a lot of similarities between your projects but I believe your approach is superior, as I'm able to already able to run models successfully that are too large for two nodes on that project (due to memory requirements even with the harsh 4 bit quantization) at full precision, when using your project. This is miraculous at any speed. I do think the distributed llama project is a bit more mature and there might be a few takeaways you might be able to learn from and add in from that project to increase the capabilities of your own (humbly). A couple of quick early observations about your project:
--nthreads
in the Distributed Llama project) is there a way to add multi-core support? I'm not seeing all cores in use when the project is 'thinking'.I have read through your paper, and while I understand the gist of how you are achieving this, I'm not at a level of understanding where I truly understand the entire process, or LLMs in general, if I'm being fully honest. I'm just a guy with an array of eight raspberry pi's that wants to run LLMs locally on them. My hardware is pretty dedicated to running edge LLMs so if there is anything I can do to help with your project, please let me know.
Last, thank you for continuing to advance and push the boundaries of edge AI, as a novice hobbyist, your expert efforts are so greatly appreciated.
The text was updated successfully, but these errors were encountered: