-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of df.insert and df.to_parquet #7325
Comments
@YarShev Thank you for your reply, in fact I only added ray.remote at the very beginning, it was to allow the initial procedure to be executed on a non-head node. |
@yx367563, can you try calling this before to_parquet? import modin.config as cfg
with cfg.context(NPartitions=1):
df = df._repartition(axis=0)
df.to_parquet(...) This should write a single parquet file. |
@YarShev I tried executing the following code but it still generates a lot of small files
|
@yx367563, sorry, I didn't put the code correctly. Please see the updated comment above. |
@YarShev Yes, currently only one part file will be generated in the folder, but the performance is still very poor. Can I assume that modin has poor performance for operations such as storage and inserting columns, and is more suitable for some computationally intensive operations? |
@yx367563, the operations such as storage and inserting columns should also perform well depending on the data size. It would be great if you could share the exact script and data you are using so we could reproduce the performance issue. |
@YarShev I'm sorry I can't provide the exact data file, but I can tell you it's a simple file with 100 columns of data, about 1 million rows, and a file size of about 500M.
|
@YarShev There is another very strange phenomenon. I tested the time taken to call each method. I found that if the operation of inserting columns is added, the time taken by |
In addition, I found that the total number of CPUs in the Ray cluster is used to set |
import modin.config as cfg
cfg.NPartitions.put(<N>) Note that if you set a value that is much greater than the number of CPUs, you will get a dataframe that is overpartitioned, which also affects performance. You can find some performance tips on Optimizations Notes page. |
We will try to reproduce your performance issue with a generated file of the size you mentioned.
Could you provide a script with the operations you perform? That would be helpful because your sentence and the script in the issue description seem to have some discrepancies. |
Could you also share the following info?
|
@YarShev The general logic is the same. The script provided at the beginning slightly simplifies some unimportant parts. You can refer to this script
python version: 3.10.12 |
@YarShev If I have a large number of CPUs in the cluster, say around 300, and the processing logic is as described above, will I still achieve better performance by setting it to 300? |
How many nodes do you have in the Ray cluster?
Actually, you should adjust this value for each concrete case.
There are no certain recommendations here as well. You should adjust CPUs and NPartitions values for each concrete case. |
@YarShev OK. In Ray Cluster I used 8~10 nodes, each with 32 cores. |
@YarShev By the way, is there a way to turn off the parallelism of Modin and only keep the dataframe data structure used by Modin, and if so, will the performance be the same as pandas? |
@YarShev It is not necessary to call the implementation logic of pandas. Just keep the dataframe data structure of modin, but remove the parallel logic and execute it serially on a single core. Is this possible? |
When you set NPartitions to 1, you have a single partitioned Modin dataframe and thus it will be proccessed on a single core (but on a worker process). Note that it requres Ray to pull data onto that worker process to be able to process the operation. We are trying to avoid this overhead in #7258 and operate directly on a pandas dataframe in the main process. |
Hi @yx367563, First of all, "achieving performance as good as native pandas, at least" is the goal for Modin, but right now it doesn't always work for some operations or some pipelines. The reason for your slowdown and phenomenons, as you said, is the materialization of data from workers. (This happens when necessary for correct executing and wastes a lot of time.) In your pipeline, this happens on the insert operation, and if you remove it, it goes into the to_parquet operation. The next point I would like to pay your attention to working on a Ray cluster. Modin expects it to run on the head node of the cluster, allowing it to manage resources correctly. If you use Modin in a remote function, it may cause performance slowdown, so we heigly recommend not using it. If you want to avoid executing on the local machine, you can create a head node on the remote machine and connect to it via ssh or
Since using a cluster increases the slowdowns caused by data transfer to/from workers, we recommend using it if local execution is not possible (For example if data is very large). |
@Retribution98 In fact, I set the CPU of the head node to 0 on the Ray Cluster and turned on the autoscaler, so if I run it directly on the head node, I will encounter the error |
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
When performing
concat
,group_by
, andmap
operations, the performance is 2-3 times faster than pandas, but a significant degradation in performance occurs when performingto_parquet
andinsert
operations, where a large number of small parquet files are observed.Note: The size of all read-in files is around 400M. Ray cluster configured with eight 32-core machines.
Expected Behavior
Performance can't be much worse than native pandas, at least.
Error Logs
No response
Installed Versions
modin: 0.30.1
ray: 2.23.0
The text was updated successfully, but these errors were encountered: