-
Notifications
You must be signed in to change notification settings - Fork 457
Redesign load_po to use Lex/Yacc state transitions style, to ease #1199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ntenance and also speed up the parsing using multiprocessing option, including debugging, abort if invalid in multiprocessing as well as in linear processing mode, plus custom hook to system's exception handling to not show the frame stacks tracing.
…olean values, allowing combinations of true_set = {"true", "1", "yes", "y", "t", "on"} (Note upper cases, mix cases are allowed), inclusing 1, 0, True, False.
…ction name in the messages, using inspect to call work out the name of the function where DEBUG_LOG was called from.
Oh... At first sight, this looks like a whole lot more complex than the parser we have right now. Could you please explain why this would be necessary, to begin with? (What prompted you to start writing this?) What are the improvements over the current implementation ( PRs are of course in general welcome, but a 1300 line PR out of the blue is a bit daunting! 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this locally using https://projects.blender.org/blender/blender-manual-translations/raw/branch/main/fi/LC_MESSAGES/blender_manual.po as the .po file to parse.
After making some changes (e.g. imports from a local_logging
package I do not have, or a dependency on sphinx_intl
(?)), this benchmarks to be about 3x slower than the current implementation on master
on my machine, and prints out a whole lot of "Invalid continuation line for state 128" errors.
With multiprocessing=True
passed to load_po
, the resulting catalog is empty.
Secondly, I see there's a lot of reliance on shared global state, which is generally not a great idea – in particular, this couldn't necessarily be safely run in e.g. a web server context (if someone needed to e.g. extract messages from a PO file for, say, a translation service).
All in all, I might suggest that if you need this for some other project, you could maintain it separately – and on the other hand, now that I look at our PO file parsing code, there are things that we could optimize even without overhauling the entire thing.
@akx FYI, there is discussion in sphinx-doc/sphinx-intl#118. I'm equally confused about the motivation behind this series of PRs. A |
Thank you all for your comments and findings. On my machine, this is the code I'm using to load and write a PO file, taking from here -- https://translate.blender.org/projects/blender-ui/ui/vi/ -- click the Files->Download translation. The file is about 7.6MB (the version on my machine). The code is this:
I use my local pyenv, and swap in and out the old and the new when testing, this is the results I've got today, the top one, 4.68 seconds is the current load_po(), I then swapped out the old env with the new env, and reran the same test, and the code took only 3.09 seconds. Now this is without running the loading in multiprocessing mode. With the processing mode on, I got 2.26 seconds for both tasks, read and write.
I am using Python 3.9.6 and running on MacOS Monterey 12.7.6. I have made some more corrections and adding a little more features to this file last few days. If you want, I can update this new code. I don't know why you have problem in running it, when I ran on my machine, it's working fine. The main reasons sparkled all this was from the fact I'm writing a little server to load serveral large PO files, to use them as a kind of dictionary when doing translations, the server runs in fastapi and the client is the vscode community version. In doing so, I can write a little extension from vscode, then invoke html calls to the server to automate tasks, like checking if a word is in the dictionary or not, or finding the highlighted keywords to see where is it in several thousand source code files, just using a single keystroke. After using this model a while, for nearly two years, I just got worked up on the loading speed, and that pushed me to rewrite the load_po(). I do sometimes load PO files individually and I could benefit from using multiprocessing mode. When loading many PO files, I use multiprocessing pool to load them concurrently and each file is loaded linearly, which proves to be much faster. However, when loading them as a single file, using multiprocessing benefit me much more. I do compare the written copy of the parsing and in this image of diff from vscode, it shown no difference whatsoever:
I agree the code appears to be more long winded but I wanted to write a version that can be maintained and readable, plus you can just turn on debug=1 and you can see what the code is doing, what state it's parsing. I had made a much faster version, more cryptic, but I decided not to pursue further with that version as it's not as maintainable as this one, though this one is a little slower after introduced the enumerations. |
…rds, plus adding handlings of previous records with '#|' prefix. This one, printing out still treated as use_comments as the current code doesn't match with specification as yet.
Can you please tell me which os are you testing on, so I can see if I can replicate your testing environment, to see exactly what problem is. Please test with the latest source code I just updated and let me know anything else I needed to check for. Thank you so much. I know it's about collaborations and not individuality here. Thank you for your assistance. I am just a translator (retired programmer), and I bet if I'm struggling with the speed, many other translators would feel the same. |
…corrected continuation line for obsolete lines, which started also with a "#~".
…ate the scope, plus fixing the obsolete line ValueError exception.
Ok, I have corrected some more, using your test file (blender_manual.po) above, the resulted runs are these:
The speed of loading: Linear Mode
Multiprocessing mode
However, there is one issue and I haven't yet got an idea how to solve it correctly, that is the problem in the grammar of the header, the key:
has been split into two lines by the 'normalize', I think. Ideally, it must be joined with the previous line to make sense. With the current model of line by line processing, I haven't yet known how to solve this and yet still maintain the correct grammar structure of the header. The common intuition is to join with previous line when error occurs, assuming it's one part that has been broken down by the 'normalize' routine. but to me, only with human eyes that we could work this out clearly, whereas with computer, we're working on assumptions and that's dangerous. That's why I need to think it over, or if you have an idea how to solve this effectively but still maintain the correct logic then do please share. I have to move that part below, "manual..." to the line above it and removed two double quotes before running the tests.
There is another problem with the comment lines when denormalise when line width changed to a huge value, like 4096, something is wrong somewhere, I will take a look at it tomorrow with the debugger. |
Python 3.13.2, MacBook Pro (M2 Max). The code I'm trying with is
The assert and print is there so I can verify that both methods result in the same number of messages. With 3767daa, running Using the previous version, 4e1ee45, with the
Using
|
Sounds like your translation memory app could benefit from simply not loading PO files as much – as you've noticed, parsing text formats tends to be slow. I assume those .po files won't change very much for your app's use, so you could just store more optimized formats of them (and recreate the optimized format when the original has changed, naturally). For instance, https://gist.github.com/akx/01a75f13324b7c36ad22c2b96c05df51 outputs
on my machine, suggesting loading the catalog from a pickle file is 6 times faster than parsing the PO file. (Running that gist does require a small patch to Babel, since our 2007-vintage FixedOffsetTimezone objects don't pickle well – I'll make a PR to get rid of them.) Even better, since your app sounds like it's basically doing database lookups, have you considered loading the catalog contents to a database? You could then do lookups like |
Thank you for the test code. I'm learning new way of doing things from you. Thank you again. |
This is impossible, I am using a rather old machine, 2010 MP 5,1 with a nvme drive, and I bet many people out there are still stuck with an old machine. It appears to me either there is some tricks or some buffering going on somewhere. I just use the @benchmark code like this:
and put @benchmark |
No, they are changing freqently, that's why I do not put them in redis. But your suggestion made me think about this again. I might try the redis db and see if the timing and easy of use are improved. Thank you again. |
How frequently? Seconds? Minutes? :) Redis might not be a great storage here, since it's in-memory and quite simple. I would suggest SQLite, PostgreSQL, even DuckDB these days. |
I don't know what to tell you – with a script like this that adapts your
I get results like
|
Yes, the removal of the first part:
from the
in order just to get the charset from the file, in
has improved the running speed and my code is not faster, but I'm glad you're aware of the downfall of the old code. I hope this fix will benefit all users out there. You only needed this:
for the
Another suggestion is to remove the TextWrapper on the comment lines, especially the header strings, to avoid ambiguities. Do them for msgid and msgstr is OK. The result I got yesterday like this:
is really annoying, when the original is like this:
is much cleaner. |
I would say in minutes, so might be I will take a look at these DB. I cannot forget this experience, when I was working, my boss asked me to test the company's codes with DB and flat files, I ran 10,000 samples over the codes to test both approaches and DB proved to be 7 times slower than flat files. This probably affected me and I tried to avoid using DB since. |
I think simplify this code:
in
of file
and change this
to this only:
in the same
routine works. I now got this back:
(for Vietnamese) |
I think this can be closed now. |
maintenance and also speed up the parsing using multiprocessing option, including debugging, abort if invalid in multiprocessing as well as in linear processing mode, plus custom hook to system's exception handling to not show the frame stacks tracing. Note the number of batches is based on the number of batches = (CPU cores - 2) / batch division reduction. The speed for 10.8MB PO file loading just takes about 1.7seconds in multi-processing mode
Using 11 batch(es) (cpu_count: 24)
number of batches: 11, batch_size: 3222
1.623452367
Execution of run took 0:00:01.62.