Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Reading large CSV files with pyarrow when values contain newline character. #59009

Open
2 of 3 tasks
matteosantama opened this issue Jun 13, 2024 · 10 comments
Open
2 of 3 tasks
Assignees
Labels
Arrow pyarrow functionality Bug IO CSV read_csv, to_csv

Comments

@matteosantama
Copy link
Contributor

matteosantama commented Jun 13, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

rows = []
for i in range(1_000_000):
    rows.append({"text": "ab\ncd", "i": i})

df = pd.DataFrame(rows)
df.to_csv("./example.csv", index=False)
pd.read_csv("./example.csv", engine="pyarrow")

Issue Description

pd.read_csv fails when reading large CSV files with engine="payarrow" if values contain newline characters. The error is

ParserError: CSV parser got out of sync with chunker. This can mean the 
data file contains cell values spanning multiple lines; please consider 
enabling the option 'newlines_in_values'.

Note the file must be large to trigger the error. Either pandas should enable this flag internally, or expose the option to the user.

Expected Behavior

Reading the file succeeds with engine="python" and I would expect consistency between the two options.

Installed Versions

In [7]: pd.show_versions()

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.9.final.0
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : None
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@matteosantama matteosantama added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 13, 2024
@matteosantama
Copy link
Contributor Author

From the latest pyarrow documentation

newlines_in_values, optional (default False)
Whether newline characters are allowed in CSV values. Setting this to True reduces the performance of multi-threaded CSV reading.

Enabling it by default would probably be a mistake. The pyarrow engine (with its multi-threaded capabilities) is the preferred option for large CSV files, though, so it'd be a shame for it to fail in this scenario.

If the pyarrow engine is here to stay, I'd recommend exposing newlines_in_values to the user.

@tilovashahrin
Copy link
Contributor

To keep the pyarrow engine, you'll need to use the pyarrow library directly to handle CSV files that contain newline characters. This involves using the ParseOptions class from pyarrow.csv to set the newlines_in_values option to True.

Example

import pyarrow as pa
import pandas as pd

rows = []
for i in range(1_000_000):
    rows.append({"text": "ab\ncd", "i": i})

df = pd.DataFrame(rows)
# Define parse options to allow newlines in values
parse_options = pv.ParseOptions(newlines_in_values=True)

# Read the CSV file using pyarrow
table = pv.read_csv("example.csv", parse_options=parse_options)

# Convert the Arrow Table to a Pandas DataFrame
df = table.to_pandas()
df

@mroeschke mroeschke added IO CSV read_csv, to_csv Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 8, 2024
@gosuchoi
Copy link

gosuchoi commented Jul 8, 2024

take

1 similar comment
@wooseogchoi
Copy link
Contributor

take

@gosuchoi gosuchoi removed their assignment Aug 20, 2024
@wooseogchoi
Copy link
Contributor

take

@wooseogchoi
Copy link
Contributor

@WillAyd
I would like to introduce a new argument in order to expose pyarrow's 'newlines_in_values' to the user because I cannot find any suitable in the current parameters. Could you please suggest new parametrer name for this, 'newlines_in_values' which might be used by another engines in the future.

@WillAyd
Copy link
Member

WillAyd commented Aug 21, 2024

Reading through the issue I don't think we actually want to change anything here - the solution from @tilovashahrin should work.

Can you check if that works for you? If so, we should add a test for it to pandas (if one doesn't already exist) and maybe update the documentation to show how to do it

@wooseogchoi
Copy link
Contributor

@WillAyd
With some modification, the codes above are working. I will add it as example in the read_csv doc.
Also I will check the test cases. If it is not there, I will add one.
Thx

@wooseogchoi
Copy link
Contributor

@WillAyd
Hi, I set up PR to resolve this issue.
As part of this, I added one test case with pyarrow. Whenever I ran it in my environment, it always passed. The exception raised.
However, when I uploaded it to PR, the checks in PR failed due to NOT raising exception. Can you please help me?
thank you

@wooseogchoi
Copy link
Contributor

#59754

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants