-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame.copy(), at least, should be threadsafe #2728
Comments
Right now, pandas is explicitly not thread-safe. Taking any step down this path You can always implement per-object or a global pandas lock in your own code, Pushing back to 0.12, at least. |
That's not quite true-- for example most things are threadsafe and we've ensured that e.g. IO functions can be run in separate threads. Perhaps we should just acquire a lock inside the copy functions for now |
copy might be thread safe with a single dtype (but prob not) |
I was thinking of #2440. Perhaps parts of pandas are thread-safe, but |
I cannot replicate the error posted, see this gist. Either there have been new developments in atomicity of (Tried under both Python 2.7 and 3.5) |
this almost certainly had to do with the unsafe-threadness (is that a word?) with |
Good! I've got 2.4.6 (under conda), upgraded to 2.5 and got the same. |
Doubtful: See: #25870 It seems, you can't can't currently use pandas series for a 'read-only' hash-lookup in a threaded environment. It fails on the second call to Series.reindex(..., copy=True) - I was extremely surprised by it, thinking the operation to be non-mutating. I would have expected any hidden object state, such as built indexes, to be finalized at the end of the first call, and subsequent calls to be safe. |
I'm deeply confused about this issue. I ask because, my expectation is that
... where |
Why is threading usefull with Pandas? But with Pandas you do a lot of CPU stuff. In that case multiprocessing would be much better - if you have enough RAM. |
There are lots of reasons to want threading that are unrelated to IO. In
fact, in most languages *except* Python threading is the first choice for
parallelism.
In my use case, I was hoping to use Pandas to hold a large static datatable
(~8Gb) to answer optimized web requests (where a database would have been
excessively slow).
Python's forking/spawning of separate processes can carry excessive
overheads shifting data between the processes, or try-as-you-might
copy-on-write ends up consuming a lot of memory if you have a lot of
processes.
If your data-access is shared-access-read-only, being able to access it in
a threaded fashion is optimal. Threading is frowned upon in Python
circles only because it hasn't been able to shake itself of the unresolved
GIL design bug. However, the GIL is a non-issue for me because most of the
heavy-lifting can be done by C code not involved with the interpreter.
As for Pandas, it's not threadsafe is in any sense that you can rely on.
Not even for read-only use cases, because pandas is not read-only, even
when for reading-type operations.
https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe/55382886#55382886
- Stuart
…On Tue, Aug 10, 2021 at 1:52 PM Codeberg-AsGithubAlternative-buhtz < ***@***.***> wrote:
Why is threading usefull with Pandas?
Threading helps when you have to much IO things.
But with Pandas you do a lot of CPU stuff. In that case multiprocessing
would be much better - if you have enough RAM.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2728 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3QJLEYTXYXCDQJBFZ5XD3T4GGRZANCNFSM4ADA2CZQ>
.
|
Thanks for your thoughts which help to dive more into Panda-thinking. ;) I am aware of Pythons "GIL-problem". But in some cases it can be used as an advantage. E.g. in the context of non-thread-safe Pandas I have to multiply the data between the processes and do not have to think about race conditions anymore. But am I right to say that threads are running always on the same CPU core, no matter which language (C, Python) they are from, right?
It is not "parallel" when running on the same Core - IMHO.
That is a nice IO use case. Web requests are IO because the thread has to wait a lot of time for the data. |
It is not "parallel" when running on the same Core - IMHO.
That is true for the Python interpreter only. Generally, threads of a
single process can use multiple cores, and vectorized code (called from
Python) can make use of multiple cores.
In my use case, I was hoping to use Pandas to hold a large static datatable
(~8Gb) to answer optimized web requests
That is a nice IO use case. Web requests are IO because the thread has to
wait a lot of time for the data.
In my case, Python was the database. I had static data and the need to
aggregate and process some 100’s of thousands of records for each request.
SQL doesn’t an provide efficient query language for large matrix operations
(our queries took under second in memory, but some minutes to run in SQL,
even with careful indexing). This case is not unusual- using pandas or
numpy to do what is too slow or cumbersome in SQL.
So, CPU bound, not IO bound, since there was no external database to wait
on.
To resolve this problem, we switched to numpy for these queries, since
pandas didn’t allow to support multiple queries safely.
|
a decade's passed and nobody's implemented this - let's close for now then |
I think it would make sense to consider reopening this issue. Multi-threaded pandas is increasingly common (e.g., inside Dask) and will only become more common in the future with the removal of Python's GIL. |
dataframe.copy() should happen atomically/be threadsafe, meaning that it should produce a consistent dataframe even if the call to .copy() is made while another thread is deleting entries from the dataframe, or if another thread calls a deletion method while the call to .copy() is working (in other words, i guess .copy() should acquire a lock that prevents mutation during the copy). That is, the following code, which crashes in 0.7.3, should succeed:
The text was updated successfully, but these errors were encountered: