Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Am I screwing up ?? [OVERRIDING APPENDING] #53600

Closed
3 tasks done
seychelles111 opened this issue Jun 11, 2023 · 4 comments
Closed
3 tasks done

BUG: Am I screwing up ?? [OVERRIDING APPENDING] #53600

seychelles111 opened this issue Jun 11, 2023 · 4 comments

Comments

@seychelles111
Copy link

seychelles111 commented Jun 11, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Hey I have a file over 30mb in length (its ok'ee since python is really power, AND FAST AS PEOPLE DONT UNDERSTAND, alltho i love R too)

FINAL_CRASH_DF = pd.DataFrame(dict())
FINAL_CRASH_DF['oracle_POSTAL_CODE']    = ""
FINAL_CRASH_DF['oracle_CITY']           = ""
FINAL_CRASH_DF['oracle_SUBDISTRICT']    = ""
FINAL_CRASH_DF['oracle_DISTRICT']       = ""

FINAL_CRASH_DF['debug_original_addr']        = ""
FINAL_CRASH_DF['debug_original_ZIP']         = ""
FINAL_CRASH_DF['debug_original_STATE']       = ""

def new_FIXED_LOCATION(ii):
    df = FINAL_CRASH_DF
     # df.loc[len(df.index)] = ['d','d','d','d','d']
    df.loc[len(df.index)] = [random.randint(2,399), 7,5 ,random.randint(2,399) ,1 ,1 ,1]
# new_FIXED_LOCATION(FINAL_CRASH_DF)

from concurrent.futures import ThreadPoolExecutor
import random
with ThreadPoolExecutor(max_workers=10) as executor:
    future = executor.map(new_FIXED_LOCATION, range(100))
    # print(future.result())

anyways i am trying to use ThreadPoolExecutor with

    df.loc[len(df.index)] = [random.randint(2,399), 7,5 ,random.randint(2,399) ,1 ,1 ,1]

i wonder if df.index and multithreading is a bad idea, and how i can solve it?

sorry i am from cambodia (Formally Kampuuchyea) i have very bad grammer :) but communication passed so its okay i guess



### Issue Description

concurrency with df.index

### Expected Behavior

multithreading, **without overriding**

### Installed Versions

<details>


INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.10.11.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en
LOCALE           : Kampuchyean-Legacy-Linguistic-1222
pandas           : 1.5.2
numpy            : 1.24.2
pytz             : 2022.4
dateutil         : 2.8.2
setuptools       : 67.7.2
pip              : 23.1.2
Cython           : 0.29.34
pytest           : 7.2.1
hypothesis       : 6.56.2
sphinx           : 6.1.3
blosc            : 1.10.6
feather          : 0.4.1
xlsxwriter       : 3.0.3
lxml.etree       : 4.9.0
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.13.1
pandas_datareader: 0.10.0
bs4              : 4.11.1
bottleneck       : 1.3.4
brotli           : 1.0.9
fastparquet      : 2023.4.0
fsspec           : 2023.5.0
gcsfs            : None
matplotlib       : 3.7.1
numba            : 0.57.0
numexpr          : 2.8.4
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 12.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.1
snappy           : 
sqlalchemy       : 1.4.47
tables           : 3.7.0
tabulate         : 0.9.0
xarray           : 2023.4.2
xlrd             : None
xlwt             : None
zstandard        : 0.20.0
tzdata           : 2022.7
</details>
@seychelles111 seychelles111 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 11, 2023
@seychelles111
Copy link
Author

seychelles111 commented Jun 11, 2023

hey guys 2002-born zoomer here again,
is pd.concat, threadsafe? ???? ? ?? ? ? ? ? ?????
and is global FINAL_CRASH_DF
whre'es global allow communication???

sorry guys all i know is copy and pasting from stackoverflow lul


from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import random

def new_FIXED_LOCATION(ii):
    global FINAL_CRASH_DF
    
    # Step 1: Create a dictionary of column names and their corresponding values
    data = {'oracle_POSTAL_CODE': random.randint(2, 399),
            'oracle_CITY': 7,
            'oracle_SUBDISTRICT': 5,
            'oracle_DISTRICT': random.randint(2, 399),
            'debug_original_addr': 1,
            'debug_original_ZIP': 1,
            'debug_original_STATE': 1}
    
    # Step 2: Create a DataFrame from the dictionary for a single row of data
    new_row = pd.DataFrame(data, index=[0])
    
    # Step 3: Concatenate the existing DataFrame (FINAL_CRASH_DF) with the new row
    FINAL_CRASH_DF = pd.concat([FINAL_CRASH_DF, new_row], ignore_index=True)
    
FINAL_CRASH_DF = pd.DataFrame(columns=['oracle_POSTAL_CODE', 'oracle_CITY', 'oracle_SUBDISTRICT', 'oracle_DISTRICT',
                                       'debug_original_addr', 'debug_original_ZIP', 'debug_original_STATE'])

with ThreadPoolExecutor(max_workers=5) as executor:
    future = executor.map(new_FIXED_LOCATION, range(100))

@seychelles111
Copy link
Author

image
Is chatgpt right ?

@Jython1415
Copy link
Contributor

Am I right in assuming that your core question is, "is pd.concat thread-safe?" I do not know for pd.concat specifically, but I can point you to some resources that discuss multithreaded usage of pandas. In general, the recommendation seems to be to not assume pandas is thread-safe, especially if you are using the copy() method.

Maybe someone else can provide a better answer about pd.concat() more specifically.

hey guys 2002-born zoomer here again,

Me too! Hello 👋

@topper-123
Copy link
Contributor

topper-123 commented Jun 13, 2023

Pandas doesn't promise thread safety unless explicitly marked so almost/probably never.

@topper-123 topper-123 added Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants