Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Datatypes not preserved on pd.read_excel #60088

Closed
2 of 3 tasks
vignesh14052002 opened this issue Oct 23, 2024 · 5 comments
Closed
2 of 3 tasks

BUG: Datatypes not preserved on pd.read_excel #60088

vignesh14052002 opened this issue Oct 23, 2024 · 5 comments
Labels
Bug IO Excel read_excel, to_excel Upstream issue Issue related to pandas dependency

Comments

@vignesh14052002
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.read_excel("./preserve_data.xlsx")

Issue Description

Input data
image

Datatype is preserved, but values are modified

TRUE -> 1
image

values are modified, even if i read as string

TRUE <-> 1
image

additionally, I want TRUE in uppercase, if it is changed to True, i can't find difference if user has 'True in a cell

Expected Behavior

There should be a way to preserve values and datatypes as it is

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_India.1252

pandas : 2.2.2
numpy : 1.26.2
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.0.2
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.9.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.23
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@vignesh14052002 vignesh14052002 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 23, 2024
@asishm asishm added the IO Excel read_excel, to_excel label Oct 23, 2024
@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 30, 2024

I think this is worth changing, can I take it ?

@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 31, 2024

Hello, I tested excel and found that as long as there is only true in the table, it will be converted to TRUE, which is a keyword in excel. So maybe user will not have True in a cell.

@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 31, 2024

But TRUE is parsed as True in the first column and 1 in the back, which I think is worth changing.Maybe we should make the default values the same for different columns

@ZKaoChi
Copy link
Contributor

ZKaoChi commented Oct 31, 2024

I created a excel to test it. The data is:

      a     b     c     d     e     f     g  h     i
0  True     1  True  True     1     1  True  1  True
1  True  True     1  True     1  True     1  1  True
2  True  True  True     1  True     1     1  1  True

If we read it directly, the result is:

>>> pd.read_excel("./123.xlsx")
      a  b  c  d  e  f  g  h     i
0  True  1  1  1  1  1  1  1  True
1  True  1  1  1  1  1  1  1  True
2  True  1  1  1  1  1  1  1  True

If we read it with dtype = str, the result is:

>>> pd.read_excel("./123.xlsx",dtype=str)
      a  b     c     d  e  f     g  h     i
0  True  1  True  True  1  1  True  1  True
1  True  1  True  True  1  1  True  1  True
2  True  1  True  True  1  1  True  1  True

I think it has something to do with the way the data is created.

@rhshadrach
Copy link
Member

As the docstring states, use dtype=object if you do not want pandas to do any inference on the dtype.

I checked both calamine and openpyxl, both readers are reading integer or Boolean values instead of e.g. TRUE. You can see this for openpyxl with:

from openpyxl import load_workbook
wb = load_workbook('test.xlsx', data_only=True)
for row in wb.worksheets[0].rows:
    for cell in row:
        print(cell, cell.value, cell.internal_value, cell.data_type)

As pandas only gets values through third-party libraries, they would need to support this first. It is likely there is a technical limitation in the Excel spec that prevents this, but I'm not certain.

As there is nothing pandas can do here, closing this issue.

@rhshadrach rhshadrach added Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

4 participants