-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster loading of Anthology class #835
Comments
CachingI've tried adding serialization functionality to the Anthology class, so it could potentially be instantiated from a cached file instead of loading from the XML/YAML. This adds quite a bit of overhead that also needs to be maintained if the classes change; you can see this in commit 7604717. Results on my system are:
6-7 seconds are a lot less than 22 seconds, but not quite the performance I'd be hoping for. I've used msgpack as the data format and lz4 for fast compression; maybe there are faster options, but my timings suggest that instantiating the Python objects is the larger bottleneck here anyway. Lazy loadingThis could certainly be an alternative; however, I tried the caching route first as this would also cut down on regular build times (since the Anthology object is instantiated by multiple scripts in the process). Lazy loading, on the other hand, would only help with scripts outside the build chain (that don't need to access all the info) or while developing. Access to The only thing that can't really be lazily instantiated is the person index ( |
Thanks for this investigation! Caching to improve the build time makes sense, but I worry about adding maintenance. It looks like you have to manually specify how to serialize and unserialize each object. Let's say I add a class member and then forget to add it to the serialization code—will the mistake be obvious? My main interest was really speed for external use of the API, so the lazy loading is appealing. But why can't the person index be lazily loaded? I am thinking of use cases where I instantiate I like the idea of lazy loading and caching the person index only, which would presumably add only a bit of complexity (versus caching for everything). |
Some more thoughts (mainly for myself :)) and investigation: As a first step, I now think the most promising route is to start optimizing the code we already have; profiling the code with pyinstrument gives some interesting insights (full profiler output here): This reveals that we spend 30% of the time instantiating the Second, to be able to compute more things lazily, I'd like to refactor the code to get rid of the universal By that time, it's probably best to look at some profiler graphs again to see what bottlenecks remain and how to solve them. |
Fantastic—thanks for the report and the merge. |
Instantiating the
Anthology
class in the Python API takes 30 seconds or so, while all the files are loaded and parsed. This is inconvenient when testing. It would be nice if the expensive data structures could be build lazily, or cached in some manner.The text was updated successfully, but these errors were encountered: