Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/python 312 pyarrow strings #129

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jayckaiser
Copy link
Collaborator

@jayckaiser jayckaiser commented Sep 27, 2024

This is a research branch trying to resolve the lack of PyArrow strings being used in Python 3.12. There are a few main changes:

  • Force datatype in read_csv() from str (i.e., Python strings) to "string" (i.e., default string type).
  • Remove forced string-deconversion from FileSource.execute().

That second change causes breakages when using nested JSON data (since we are no longer using generic object datatypes. The following fixes are required:

  • Overload fromjson() Jinja macro to use ast.literal_eval() when JSON has single-quotes.
  • Require fromjson() be applied to Jinja templating in YAML when retrieving nested fields.

Some observations:

  • The differences in runtime between Python 3.8 and 3.12 is drastic. Python 3.8 runs in one third the time with 2/3rds the memory during earthmover -t.
  • We can remove the mandatory calls to the pyarrow backend, since this is turned on by default when available and raises an error in 3.8 otherwise.
  • I have yet to run this on a larger dataset where these performance impacts would be more noteworthy. Please try running such and let me know if we're only seeing poorer performance on smaller datasets (given the overhead when serializing).

Please let me know your thoughts.

@tomreitz
Copy link
Collaborator

Thanks for this, @jayckaiser - exciting work. I'll dig into it, test, etc. probably the week after next.

In the meantime, I want to ask/clarify two things:

The differences in runtime between Python 3.8 and 3.12 is drastic. Python 3.8 runs in one third the time with 2/3rds the memory during earthmover -t.

I'm confused about which is faster... are you saying that with your PyArrow changes/optimizaitions, the runtime is slower on 3.12? (this seems counterintuitive to me) Or is it the other way around, it's faster on 3.12?

That second change causes breakages when using nested JSON data (since we are no longer using generic object datatypes. [and the two bullet points below this]

Suppose earthmover reads in a JSONL file containing a line/row/payload like (un-linearized)

{
  "field": {
    "some": {
      "deeply": {
        "nested": {
          "property": "value"
        }
      }
    }
  }
}

Are you saying that

  • previously field would be passed through dataframes as an object column, and hence could be referenced in a .jsont template with {{ field.some.deeply.nested.property }}
  • with these changes, the field would be passed through dataframes as a (PyArrow) string column, and thus a .jsont template would have to do {% set field_object = fromjson(field) %}{{field_object.some.deeply.nested.property}} (or similar)

(I'd really like to avoid changes to earthmover that would require changes to projects' earthmover.yml and/or *.jsont, so I'm hoping I'm misunderstanding here.)

Base automatically changed from feature/python-312 to main October 16, 2024 21:48
@tomreitz
Copy link
Collaborator

@jayckaiser I ran example_projects/01_simple/big_earthmover.yaml; with Python 3.10:

$ python3 -V
Python 3.10.12
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml 
2024-10-18 10:38:33.953 earthmover INFO starting...
2024-10-18 10:38:34.024 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 11:59:51.878 earthmover INFO done!
        User time (seconds): 3684.89
        System time (seconds): 81.00
        Percent of CPU this job got: 77%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21:19
        Maximum resident set size (kbytes): 1347476
        ...

(so 1hr 21min, max 1.3GB memory used - this was with a 3.2GB input TSV file, producing a 28GB JSONL file)

With Python 3.12:

$ python3 -V
Python 3.12.5
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml 
2024-10-18 12:22:34.749 earthmover INFO starting...
2024-10-18 12:22:34.813 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 14:16:41.572 earthmover INFO done!
        User time (seconds): 5460.58
        System time (seconds): 127.11
        Percent of CPU this job got: 81%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:09
        Maximum resident set size (kbytes): 1722360
        ...

so 1hr 54min (38% longer), max 1.7GB memory used (24% more). This confirms your result on a large dataset: slower and less memory efficient under Python 3.12 with Pyarrow strings 😢.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants