Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use get_records_with_cache to cache to_records calls #286

Merged
merged 32 commits into from
Jul 18, 2024

Conversation

ntBre
Copy link
Contributor

@ntBre ntBre commented Jun 26, 2024

Description

This PR uses the get_records_with_cache function discussed in our last QCArchive meeting to cache to_records calls automatically. With these changes alone, users cannot actually access this behavior, but when combined with #284, it enables code like this:

from qcportal import PortalClient

from openff.qcsubmit._tests.utils.test_manager import no_internet
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.utils.utils import portal_client_manager

opt = OptimizationResultCollection.parse_file("tiny-opt-dataset.json")
client = PortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".")

with portal_client_manager(lambda _: client):
    print(len(opt.to_records()))
    with no_internet():
        print(len(opt.to_records()))

For this PR, I applied the changes separately from #284 to the main branch, but I have used this code to test the combined changes locally, and it works! I have not run into MolSSI/QCFractal#844 here yet, so hopefully that is less common in real use cases.

Todos

  • Enable caching for expensive to_records calls
  • Probably wait for Enable use of a custom PortalClient #284. This isn't really useful on its own, but I was excited to get a proof of concept working, and I think both changes are actually needed for either one to help my valence fits

Status

  • Ready to go

@ntBre
Copy link
Contributor Author

ntBre commented Jun 27, 2024

After thinking more about this, possibly a better solution would be for us (or actually a user) to subclass PortalClient and override the various get_{singlepoints,optimizations,torsiondrives} methods with cached versions. That would avoid having to add the record_cache field to the standard PortalClient, which felt weird to me, and revert all of these changes because they would be handled by providing this new CachedPortalClient. I'll try this approach tomorrow.

@ntBre
Copy link
Contributor Author

ntBre commented Jun 27, 2024

Here's an example of the PortalClient subclass. Unfortunately, I also found out that the no_internet trick is not infallible. I think the fact that the PortalClient opens a requests.Session and holds onto it means that it can access the internet without going back through socket.socket. However, I monitored internet traffic with Wireshark and made sure that this code does not access the internet on the second to_records call, which is also verified by setting _req_session to None.

import os
import shutil

from qcportal import PortalClient
from qcportal.cache import RecordCache, get_records_with_cache
from qcportal.optimization import OptimizationRecord

from openff.qcsubmit._tests.utils.test_manager import no_internet
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.utils.utils import portal_client_manager


class CachedPortalClient(PortalClient):
    def __init__(self, addr, cache_dir, **client_kwargs):
        super().__init__(addr, cache_dir=cache_dir, **client_kwargs)
        self.record_cache = RecordCache(
            os.path.join(self.cache.cache_dir, "cache.sqlite"), read_only=False
        )

    def get_optimizations(self, record_ids, missing_ok=False, include=None):
        return get_records_with_cache(
            self,
            self.record_cache,
            OptimizationRecord,
            record_ids,
            include=["initial_molecule", "final_molecule"],
        )


if os.path.exists("api.qcarchive.molssi.org_443"):
    shutil.rmtree("api.qcarchive.molssi.org_443")

opt = OptimizationResultCollection.parse_file("tiny-opt-dataset.json")
client = CachedPortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".")

with portal_client_manager(lambda _: client):
    client._req_session = None
    with no_internet():
        print(len(opt.to_records()))

This code runs on the branch for #284 without any changes to qcsubmit. The implementation of CachedPortalClient basically needs to know which methods we call on PortalClient internally, though, so I think it might make sense for us to provide this class even though a user could define it.

@ntBre
Copy link
Contributor Author

ntBre commented Jul 10, 2024

I've updated the implementation to use the CachedPortalClient from the code block above. We discussed making the class private, but I left it public for now because I think it might be useful for anyone wanting to use additional PortalClient kwargs with the portal_client_manager. I'm happy to make it private if you prefer, though. That's a pretty advanced use case and maybe not one we want to support in the public API.

I have not added any tests specifically for the caching yet, but I think it's a great first sign that the tests pass after replacing the default PortalClient with this cached version. I also tried intentionally introducing errors in each of the new methods to make sure they were actually being called, so they are all covered by existing tests at least.

One issue I've run into with testing this is that each test uses the same default cache_dir. I tried adding a pytest fixture to delete the cache dir before each test ran, but this caused disastrous issues with pytest-xdist running multiple tests in parallel (one would be trying to write to the cache while the next test tried deleting the dir). I think this means that fully testing the caches will be more involved than I hoped. It also means that I currently have to delete the cache dir locally between each test run, or the tests just use the cache. That won't be an issue in CI at least. My best idea for fixing this currently is to modify the tests using PortalClients more invasively to wrap the code in portal_client_managers with temporary cache dirs so that each test can run fully independently and clean itself up afterward.

@ntBre
Copy link
Contributor Author

ntBre commented Jul 11, 2024

The tests should now be passing with temporary cache directories everywhere CachedPortalClient is used. In my local tests I was seeing ignored instances of the error from MolSSI/QCFractal#844, but it sounds like that has been fixed in the next version (0.56) of qcportal.

Now, I think I just need to update these tests to check that the caching actually worked.

@ntBre ntBre marked this pull request as ready for review July 12, 2024 19:29
@ntBre
Copy link
Contributor Author

ntBre commented Jul 12, 2024

The tests have been updated, and I copied over (and modified) documentation from qcportal. It should be ready for review!

@ntBre ntBre requested a review from j-wags July 12, 2024 19:31
Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ntBre. We reviewed this during our check in this week and I summarized our discussion in the comments. This is good to merge once the (blocking) items are resolved and the releasenotes are updated!

returns a list of records.
"""
if missing_ok:
logger.warning("missing_ok provided but unused by CachedPortalClient")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking) Maybe something more like "missing_ok was set to True, but CachedPortalClient doesn't actually support this so it's being set to False"



def _default_portal_client(client_address) -> PortalClient:
return PortalClient(client_address)
return CachedPortalClient(client_address, cache_dir=_DEFAULT_CACHE_DIR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) Given that this may not actually be thread-safe, let's make caching not be the default. (or implement a test using multiprocessing or something to ensure that two different processes with different clients but the same cache dir don't trip each other up).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on this. This code fails with a database is locked error:

from multiprocessing import Pool

from qcportal import PortalClient

from openff.qcsubmit.results import TorsionDriveResultCollection
from openff.qcsubmit.utils import CachedPortalClient, portal_client_manager

ds = TorsionDriveResultCollection.parse_file(
    "/home/brent/omsf/projects/valence-fitting/02_curate-data/datasets/supp-td.json"
)
datasets = [ds, ds, ds, ds]
client = CachedPortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".")
with portal_client_manager(lambda _: client), Pool(4) as pool:
    for res in pool.imap_unordered(
        TorsionDriveResultCollection.to_records, datasets, 1
    ):
        print(len(res))

It also fails without the portal_client_manager and CachedPortalClient as the default client, so it's not safe to share the same client across threads or to construct multiple clients accessing the same database.

It actually doesn't quite work most of the time, even with a regular PortalClient with or without a cache dir, but some of the processes have read timeout errors rather than the more dramatic database errors from the cached version. It's probably best in general not to share a PortalClient across threads.

Using with portal_client_manager(PortalClient), Pool(4) as pool: (removing the lambda that shares the same client), as your suggestion will restore the default to, has worked successfully on every run I've tried so far, so I think that's definitely the right default.

Do you think it's worth mentioning something about this in the documentation for portal_client_manager? The docs currently show an example of a function that returns a new PortalClient each time (which should be safe) rather than a lambda _: client. I'm not really sure how to warn against this without showing the lambda code in the docs, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and for checking on multithreading. I've read this a few times and think I kinda understand, and my preference is to do the simplest safe thing. So I agree with reverting the default behavior. And updating the docs is optional, you might just copy your comment "It's not safe to share the same client across threads or to construct multiple clients accessing the same database" into the portal_client_manager docstring and call it a day.

openff/qcsubmit/utils/utils.py Show resolved Hide resolved
openff/qcsubmit/utils/utils.py Outdated Show resolved Hide resolved
openff/qcsubmit/utils/utils.py Outdated Show resolved Hide resolved
openff/qcsubmit/utils/utils.py Outdated Show resolved Hide resolved
docs/api.rst Outdated Show resolved Hide resolved
openff/qcsubmit/utils/utils.py Show resolved Hide resolved
docs/api.rst Show resolved Hide resolved
@ntBre ntBre merged commit ccedbd4 into main Jul 18, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants