-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/python 312 pyarrow strings #129
base: main
Are you sure you want to change the base?
Conversation
Thanks for this, @jayckaiser - exciting work. I'll dig into it, test, etc. probably the week after next. In the meantime, I want to ask/clarify two things:
I'm confused about which is faster... are you saying that with your PyArrow changes/optimizaitions, the runtime is slower on 3.12? (this seems counterintuitive to me) Or is it the other way around, it's faster on 3.12?
Suppose earthmover reads in a JSONL file containing a line/row/payload like (un-linearized) {
"field": {
"some": {
"deeply": {
"nested": {
"property": "value"
}
}
}
}
} Are you saying that
(I'd really like to avoid changes to earthmover that would require changes to projects' |
@jayckaiser I ran $ python3 -V
Python 3.10.12
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml
2024-10-18 10:38:33.953 earthmover INFO starting...
2024-10-18 10:38:34.024 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 11:59:51.878 earthmover INFO done!
User time (seconds): 3684.89
System time (seconds): 81.00
Percent of CPU this job got: 77%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21:19
Maximum resident set size (kbytes): 1347476
... (so 1hr 21min, max 1.3GB memory used - this was with a 3.2GB input TSV file, producing a 28GB JSONL file) With Python 3.12: $ python3 -V
Python 3.12.5
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml
2024-10-18 12:22:34.749 earthmover INFO starting...
2024-10-18 12:22:34.813 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 14:16:41.572 earthmover INFO done!
User time (seconds): 5460.58
System time (seconds): 127.11
Percent of CPU this job got: 81%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:09
Maximum resident set size (kbytes): 1722360
... so 1hr 54min (38% longer), max 1.7GB memory used (24% more). This confirms your result on a large dataset: slower and less memory efficient under Python 3.12 with Pyarrow strings 😢. |
This is a research branch trying to resolve the lack of PyArrow strings being used in Python 3.12. There are a few main changes:
read_csv()
fromstr
(i.e., Python strings) to"string"
(i.e., default string type).FileSource.execute()
.That second change causes breakages when using nested JSON data (since we are no longer using generic
object
datatypes. The following fixes are required:fromjson()
Jinja macro to useast.literal_eval()
when JSON has single-quotes.fromjson()
be applied to Jinja templating in YAML when retrieving nested fields.Some observations:
earthmover -t
.pyarrow
backend, since this is turned on by default when available and raises an error in 3.8 otherwise.Please let me know your thoughts.