BUG: Datatypes not preserved on pd.read_excel #60088

vignesh14052002 · 2024-10-23T14:28:57Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.read_excel("./preserve_data.xlsx")

Issue Description

Input data

Datatype is preserved, but values are modified

TRUE -> 1

values are modified, even if i read as string

TRUE <-> 1

additionally, I want TRUE in uppercase, if it is changed to True, i can't find difference if user has 'True in a cell

Expected Behavior

There should be a way to preserve values and datatypes as it is

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_India.1252

pandas : 2.2.2
numpy : 1.26.2
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.0.2
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.9.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.23
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

ZKaoChi · 2024-10-30T10:47:53Z

I think this is worth changing, can I take it ?

ZKaoChi · 2024-10-31T08:15:40Z

Hello, I tested excel and found that as long as there is only true in the table, it will be converted to TRUE, which is a keyword in excel. So maybe user will not have True in a cell.

ZKaoChi · 2024-10-31T08:19:37Z

But TRUE is parsed as True in the first column and 1 in the back, which I think is worth changing.Maybe we should make the default values the same for different columns

ZKaoChi · 2024-10-31T10:40:57Z

I created a excel to test it. The data is:

      a     b     c     d     e     f     g  h     i
0  True     1  True  True     1     1  True  1  True
1  True  True     1  True     1  True     1  1  True
2  True  True  True     1  True     1     1  1  True

If we read it directly, the result is：

>>> pd.read_excel("./123.xlsx")
      a  b  c  d  e  f  g  h     i
0  True  1  1  1  1  1  1  1  True
1  True  1  1  1  1  1  1  1  True
2  True  1  1  1  1  1  1  1  True

If we read it with dtype = str, the result is:

>>> pd.read_excel("./123.xlsx",dtype=str)
      a  b     c     d  e  f     g  h     i
0  True  1  True  True  1  1  True  1  True
1  True  1  True  True  1  1  True  1  True
2  True  1  True  True  1  1  True  1  True

I think it has something to do with the way the data is created.

rhshadrach · 2024-11-02T12:54:43Z

As the docstring states, use dtype=object if you do not want pandas to do any inference on the dtype.

I checked both calamine and openpyxl, both readers are reading integer or Boolean values instead of e.g. TRUE. You can see this for openpyxl with:

from openpyxl import load_workbook
wb = load_workbook('test.xlsx', data_only=True)
for row in wb.worksheets[0].rows:
    for cell in row:
        print(cell, cell.value, cell.internal_value, cell.data_type)

As pandas only gets values through third-party libraries, they would need to support this first. It is likely there is a technical limitation in the Excel spec that prevents this, but I'm not certain.

As there is nothing pandas can do here, closing this issue.

vignesh14052002 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 23, 2024

asishm added the IO Excel read_excel, to_excel label Oct 23, 2024

rhshadrach closed this as completed Nov 2, 2024

rhshadrach added Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Datatypes not preserved on pd.read_excel #60088

BUG: Datatypes not preserved on pd.read_excel #60088

vignesh14052002 commented Oct 23, 2024

INSTALLED VERSIONS

ZKaoChi commented Oct 30, 2024

ZKaoChi commented Oct 31, 2024 •

edited

Loading

ZKaoChi commented Oct 31, 2024

ZKaoChi commented Oct 31, 2024

rhshadrach commented Nov 2, 2024

BUG: Datatypes not preserved on pd.read_excel #60088

BUG: Datatypes not preserved on pd.read_excel #60088

Comments

vignesh14052002 commented Oct 23, 2024

Pandas version checks

Reproducible Example

Issue Description

Datatype is preserved, but values are modified

values are modified, even if i read as string

Expected Behavior

Installed Versions

INSTALLED VERSIONS

ZKaoChi commented Oct 30, 2024

ZKaoChi commented Oct 31, 2024 • edited Loading

ZKaoChi commented Oct 31, 2024

ZKaoChi commented Oct 31, 2024

rhshadrach commented Nov 2, 2024

ZKaoChi commented Oct 31, 2024 •

edited

Loading