Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.copy(), at least, should be threadsafe #2728

Open
bshanks opened this issue Jan 23, 2013 · 15 comments
Open

DataFrame.copy(), at least, should be threadsafe #2728

bshanks opened this issue Jan 23, 2013 · 15 comments
Labels
Bug Multithreading Parallelism in pandas

Comments

@bshanks
Copy link

bshanks commented Jan 23, 2013

dataframe.copy() should happen atomically/be threadsafe, meaning that it should produce a consistent dataframe even if the call to .copy() is made while another thread is deleting entries from the dataframe, or if another thread calls a deletion method while the call to .copy() is working (in other words, i guess .copy() should acquire a lock that prevents mutation during the copy). That is, the following code, which crashes in 0.7.3, should succeed:

import pandas
import threading

df = pandas.DataFrame()

def mutateDf(df):
    while True:
        df[0] = pandas.Series([1,2,3])
        del df[0]

def readDf(df):
    while True:
        dfCopy = df.copy()
        if 0 in dfCopy and 1 in dfCopy[0]:
            a = dfCopy[0][1]

t1 = threading.Thread(target=mutateDf, args=(df,))
t2 = threading.Thread(target=readDf, args=(df,))

t1.start()
t2.start()
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "<ipython-input-5-8aef72c7f1b4>", line 4, in readDf
    if 0 in dfCopy and 1 in dfCopy[0]:
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1458, in __getitem__
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 294, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 625, in get
    _, block = self._find_block(item)
TypeError: 'NoneType' object is not iterable
@ghost
Copy link

ghost commented Mar 16, 2013

Right now, pandas is explicitly not thread-safe. Taking any step down this path
will inevitably generate lots of pain and changes all over. Python threads see more
limited use then in other languages, the upside is correspondingly limited.

You can always implement per-object or a global pandas lock in your own code,
if threads are what you want.

Pushing back to 0.12, at least.

@wesm
Copy link
Member

wesm commented Mar 16, 2013

That's not quite true-- for example most things are threadsafe and we've ensured that e.g. IO functions can be run in separate threads. Perhaps we should just acquire a lock inside the copy functions for now

@jreback
Copy link
Contributor

jreback commented Mar 16, 2013

copy might be thread safe with a single dtype (but prob not)
multiple dtypes now are not thread safe
(as @wes points out a lock will fix all this)
I would be in favor of providing this as an option, default to False though

@ghost
Copy link

ghost commented Mar 16, 2013

I was thinking of #2440. Perhaps parts of pandas are thread-safe, but
afik there's no list of what's safe or not and users have hit the non-safe
parts before this, when they tried.

@kokes
Copy link
Contributor

kokes commented Mar 10, 2016

I cannot replicate the error posted, see this gist. Either there have been new developments in atomicity of pandas or perhaps threading has a different scheduler, or...?

(Tried under both Python 2.7 and 3.5)

@jreback
Copy link
Contributor

jreback commented Mar 10, 2016

this almost certainly had to do with the unsafe-threadness (is that a word?) with numexpr. numexpr>=2.5 (and even >=2.4.6) now don't much with the global thread state. @kokes what version do you have?

@kokes
Copy link
Contributor

kokes commented Mar 10, 2016

Good! I've got 2.4.6 (under conda), upgraded to 2.5 and got the same.

@allComputableThings
Copy link

allComputableThings commented Mar 25, 2019

copy might be thread safe with a single dtype

Doubtful: See: #25870

It seems, you can't can't currently use pandas series for a 'read-only' hash-lookup in a threaded environment.

It fails on the second call to Series.reindex(..., copy=True) - I was extremely surprised by it, thinking the operation to be non-mutating. I would have expected any hidden object state, such as built indexes, to be finalized at the end of the first call, and subsequent calls to be safe.

@allComputableThings
Copy link

I'm deeply confused about this issue.
The original discussion was that .copy is not thread safe. My assumption was that it would not be, because someone may be writing to the dataframe. However, is .copy also unsafe in other situations (where no-one is performing impure functional operations, such as modifying the columns/index/labels/cells of a dataframe or series)?

I ask because, my expectation is that Series.reindex(..., copy=True) is a pure function (except for memomization of things like the internal index). Yet it is seem to not be thread safe while no other types of operation are happening. Copy is happening, but no-one is writing. So what?

s = pd.Series(...)
f(s)  # Success!

# Thread 1:
   while True: f(s)  

# Thread 2:
   while True: f(s)  # BANG! Exception !

... where f(s): s.reindex(..., copy=True). Can the thread-unsafeness of .copy really the cause?

@buhtz
Copy link

buhtz commented Aug 10, 2021

Why is threading usefull with Pandas?
Threading helps when you have to much IO things.

But with Pandas you do a lot of CPU stuff. In that case multiprocessing would be much better - if you have enough RAM.

@allComputableThings
Copy link

allComputableThings commented Aug 10, 2021 via email

@buhtz
Copy link

buhtz commented Aug 11, 2021

Thanks for your thoughts which help to dive more into Panda-thinking. ;)

I am aware of Pythons "GIL-problem". But in some cases it can be used as an advantage. E.g. in the context of non-thread-safe Pandas I have to multiply the data between the processes and do not have to think about race conditions anymore.

But am I right to say that threads are running always on the same CPU core, no matter which language (C, Python) they are from, right?

There are lots of reasons to want threading that are unrelated to IO.
In fact, in most languages except Python threading is the first
choice for parallelism.

It is not "parallel" when running on the same Core - IMHO.

In my use case, I was hoping to use Pandas to hold a large static datatable (~8Gb) to answer optimized web requests

That is a nice IO use case. Web requests are IO because the thread has to wait a lot of time for the data.

@allComputableThings
Copy link

allComputableThings commented Aug 11, 2021 via email

@MarcoGorelli
Copy link
Member

I would be in favor of providing this as an option, default to False though

a decade's passed and nobody's implemented this - let's close for now then

@shoyer
Copy link
Member

shoyer commented Feb 12, 2025

I think it would make sense to consider reopening this issue. Multi-threaded pandas is increasingly common (e.g., inside Dask) and will only become more common in the future with the removal of Python's GIL.

@MarcoGorelli MarcoGorelli reopened this Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Multithreading Parallelism in pandas
Projects
None yet
Development

No branches or pull requests

9 participants