Feature/python 312 pyarrow strings #129

jayckaiser · 2024-09-27T20:08:25Z

This is a research branch trying to resolve the lack of PyArrow strings being used in Python 3.12. There are a few main changes:

Force datatype in read_csv() from str (i.e., Python strings) to "string" (i.e., default string type).
Remove forced string-deconversion from FileSource.execute().

That second change causes breakages when using nested JSON data (since we are no longer using generic object datatypes. The following fixes are required:

Overload fromjson() Jinja macro to use ast.literal_eval() when JSON has single-quotes.
Require fromjson() be applied to Jinja templating in YAML when retrieving nested fields.

Some observations:

The differences in runtime between Python 3.8 and 3.12 is drastic. Python 3.8 runs in one third the time with 2/3rds the memory during earthmover -t.
We can remove the mandatory calls to the pyarrow backend, since this is turned on by default when available and raises an error in 3.8 otherwise.
I have yet to run this on a larger dataset where these performance impacts would be more noteworthy. Please try running such and let me know if we're only seeing poorer performance on smaller datasets (given the overhead when serializing).

Please let me know your thoughts.

…teral_eval.

tomreitz · 2024-09-27T20:33:12Z

Thanks for this, @jayckaiser - exciting work. I'll dig into it, test, etc. probably the week after next.

In the meantime, I want to ask/clarify two things:

The differences in runtime between Python 3.8 and 3.12 is drastic. Python 3.8 runs in one third the time with 2/3rds the memory during earthmover -t.

I'm confused about which is faster... are you saying that with your PyArrow changes/optimizaitions, the runtime is slower on 3.12? (this seems counterintuitive to me) Or is it the other way around, it's faster on 3.12?

That second change causes breakages when using nested JSON data (since we are no longer using generic object datatypes. [and the two bullet points below this]

Suppose earthmover reads in a JSONL file containing a line/row/payload like (un-linearized)

{
  "field": {
    "some": {
      "deeply": {
        "nested": {
          "property": "value"
        }
      }
    }
  }
}

Are you saying that

previously field would be passed through dataframes as an object column, and hence could be referenced in a .jsont template with {{ field.some.deeply.nested.property }}
with these changes, the field would be passed through dataframes as a (PyArrow) string column, and thus a .jsont template would have to do {% set field_object = fromjson(field) %}{{field_object.some.deeply.nested.property}} (or similar)

(I'd really like to avoid changes to earthmover that would require changes to projects' earthmover.yml and/or *.jsont, so I'm hoping I'm misunderstanding here.)

tomreitz · 2024-10-18T19:28:46Z

@jayckaiser I ran example_projects/01_simple/big_earthmover.yaml; with Python 3.10:

$ python3 -V
Python 3.10.12
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml 
2024-10-18 10:38:33.953 earthmover INFO starting...
2024-10-18 10:38:34.024 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 11:59:51.878 earthmover INFO done!
        User time (seconds): 3684.89
        System time (seconds): 81.00
        Percent of CPU this job got: 77%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21:19
        Maximum resident set size (kbytes): 1347476
        ...

(so 1hr 21min, max 1.3GB memory used - this was with a 3.2GB input TSV file, producing a 28GB JSONL file)

With Python 3.12:

$ python3 -V
Python 3.12.5
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml 
2024-10-18 12:22:34.749 earthmover INFO starting...
2024-10-18 12:22:34.813 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 14:16:41.572 earthmover INFO done!
        User time (seconds): 5460.58
        System time (seconds): 127.11
        Percent of CPU this job got: 81%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:09
        Maximum resident set size (kbytes): 1722360
        ...

so 1hr 54min (38% longer), max 1.7GB memory used (24% more). This confirms your result on a large dataset: slower and less memory efficient under Python 3.12 with Pyarrow strings 😢.

jayckaiser added 5 commits September 27, 2024 14:00

Force use of generic string type when loading CSV.

3edd794

Add fromjson() around all dictionary columns.

9416f6c

Overload fromjson to accept malformed single-quoted JSON using ast.li…

d54c05c

…teral_eval.

Add check to return dictionary input as is in fromjson.

7983211

Remove forced pyarrow calls to avoid issues in 3.8.

53d4cdb

jayckaiser requested review from johncmerfeld and tomreitz September 27, 2024 20:08

Base automatically changed from feature/python-312 to main October 16, 2024 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/python 312 pyarrow strings #129

Feature/python 312 pyarrow strings #129

jayckaiser commented Sep 27, 2024 •

edited

Loading

tomreitz commented Sep 27, 2024

tomreitz commented Oct 18, 2024

Feature/python 312 pyarrow strings #129

Are you sure you want to change the base?

Feature/python 312 pyarrow strings #129

Conversation

jayckaiser commented Sep 27, 2024 • edited Loading

tomreitz commented Sep 27, 2024

tomreitz commented Oct 18, 2024

jayckaiser commented Sep 27, 2024 •

edited

Loading