Use `get_records_with_cache` to cache `to_records` calls #286

ntBre · 2024-06-26T21:24:55Z

Description

This PR uses the get_records_with_cache function discussed in our last QCArchive meeting to cache to_records calls automatically. With these changes alone, users cannot actually access this behavior, but when combined with #284, it enables code like this:

from qcportal import PortalClient

from openff.qcsubmit._tests.utils.test_manager import no_internet
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.utils.utils import portal_client_manager

opt = OptimizationResultCollection.parse_file("tiny-opt-dataset.json")
client = PortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".")

with portal_client_manager(lambda _: client):
    print(len(opt.to_records()))
    with no_internet():
        print(len(opt.to_records()))

For this PR, I applied the changes separately from #284 to the main branch, but I have used this code to test the combined changes locally, and it works! I have not run into MolSSI/QCFractal#844 here yet, so hopefully that is less common in real use cases.

Todos

Enable caching for expensive to_records calls
Probably wait for Enable use of a custom PortalClient #284. This isn't really useful on its own, but I was excited to get a proof of concept working, and I think both changes are actually needed for either one to help my valence fits

Status

Ready to go

ntBre · 2024-06-27T02:21:51Z

After thinking more about this, possibly a better solution would be for us (or actually a user) to subclass PortalClient and override the various get_{singlepoints,optimizations,torsiondrives} methods with cached versions. That would avoid having to add the record_cache field to the standard PortalClient, which felt weird to me, and revert all of these changes because they would be handled by providing this new CachedPortalClient. I'll try this approach tomorrow.

ntBre · 2024-06-27T17:25:52Z

Here's an example of the PortalClient subclass. Unfortunately, I also found out that the no_internet trick is not infallible. I think the fact that the PortalClient opens a requests.Session and holds onto it means that it can access the internet without going back through socket.socket. However, I monitored internet traffic with Wireshark and made sure that this code does not access the internet on the second to_records call, which is also verified by setting _req_session to None.

import os
import shutil

from qcportal import PortalClient
from qcportal.cache import RecordCache, get_records_with_cache
from qcportal.optimization import OptimizationRecord

from openff.qcsubmit._tests.utils.test_manager import no_internet
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.utils.utils import portal_client_manager


class CachedPortalClient(PortalClient):
    def __init__(self, addr, cache_dir, **client_kwargs):
        super().__init__(addr, cache_dir=cache_dir, **client_kwargs)
        self.record_cache = RecordCache(
            os.path.join(self.cache.cache_dir, "cache.sqlite"), read_only=False
        )

    def get_optimizations(self, record_ids, missing_ok=False, include=None):
        return get_records_with_cache(
            self,
            self.record_cache,
            OptimizationRecord,
            record_ids,
            include=["initial_molecule", "final_molecule"],
        )


if os.path.exists("api.qcarchive.molssi.org_443"):
    shutil.rmtree("api.qcarchive.molssi.org_443")

opt = OptimizationResultCollection.parse_file("tiny-opt-dataset.json")
client = CachedPortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".")

with portal_client_manager(lambda _: client):
    client._req_session = None
    with no_internet():
        print(len(opt.to_records()))

This code runs on the branch for #284 without any changes to qcsubmit. The implementation of CachedPortalClient basically needs to know which methods we call on PortalClient internally, though, so I think it might make sense for us to provide this class even though a user could define it.

This reverts commit 4c649b2.

This reverts commit 8f8c4d9.

hoping to trigger test failures

ntBre · 2024-07-10T19:42:39Z

I've updated the implementation to use the CachedPortalClient from the code block above. We discussed making the class private, but I left it public for now because I think it might be useful for anyone wanting to use additional PortalClient kwargs with the portal_client_manager. I'm happy to make it private if you prefer, though. That's a pretty advanced use case and maybe not one we want to support in the public API.

I have not added any tests specifically for the caching yet, but I think it's a great first sign that the tests pass after replacing the default PortalClient with this cached version. I also tried intentionally introducing errors in each of the new methods to make sure they were actually being called, so they are all covered by existing tests at least.

One issue I've run into with testing this is that each test uses the same default cache_dir. I tried adding a pytest fixture to delete the cache dir before each test ran, but this caused disastrous issues with pytest-xdist running multiple tests in parallel (one would be trying to write to the cache while the next test tried deleting the dir). I think this means that fully testing the caches will be more involved than I hoped. It also means that I currently have to delete the cache dir locally between each test run, or the tests just use the cache. That won't be an issue in CI at least. My best idea for fixing this currently is to modify the tests using PortalClients more invasively to wrap the code in portal_client_managers with temporary cache dirs so that each test can run fully independently and clean itself up afterward.

ntBre · 2024-07-11T18:49:53Z

The tests should now be passing with temporary cache directories everywhere CachedPortalClient is used. In my local tests I was seeing ignored instances of the error from MolSSI/QCFractal#844, but it sounds like that has been fixed in the next version (0.56) of qcportal.

Now, I think I just need to update these tests to check that the caching actually worked.

ntBre · 2024-07-12T19:31:30Z

The tests have been updated, and I copied over (and modified) documentation from qcportal. It should be ready for review!

j-wags

Thanks @ntBre. We reviewed this during our check in this week and I summarized our discussion in the comments. This is good to merge once the (blocking) items are resolved and the releasenotes are updated!

j-wags · 2024-07-15T18:57:41Z

openff/qcsubmit/utils/utils.py

+            returns a list of records.
+        """
+        if missing_ok:
+            logger.warning("missing_ok provided but unused by CachedPortalClient")


(not blocking) Maybe something more like "missing_ok was set to True, but CachedPortalClient doesn't actually support this so it's being set to False"

j-wags · 2024-07-15T19:39:48Z

openff/qcsubmit/utils/utils.py



 def _default_portal_client(client_address) -> PortalClient:
-    return PortalClient(client_address)
+    return CachedPortalClient(client_address, cache_dir=_DEFAULT_CACHE_DIR)


(blocking) Given that this may not actually be thread-safe, let's make caching not be the default. (or implement a test using multiprocessing or something to ensure that two different processes with different clients but the same cache dir don't trip each other up).

Good call on this. This code fails with a database is locked error:

from multiprocessing import Pool from qcportal import PortalClient from openff.qcsubmit.results import TorsionDriveResultCollection from openff.qcsubmit.utils import CachedPortalClient, portal_client_manager ds = TorsionDriveResultCollection.parse_file( "/home/brent/omsf/projects/valence-fitting/02_curate-data/datasets/supp-td.json" ) datasets = [ds, ds, ds, ds] client = CachedPortalClient("https://api.qcarchive.molssi.org:443", cache_dir=".") with portal_client_manager(lambda _: client), Pool(4) as pool: for res in pool.imap_unordered( TorsionDriveResultCollection.to_records, datasets, 1 ): print(len(res))

It also fails without the portal_client_manager and CachedPortalClient as the default client, so it's not safe to share the same client across threads or to construct multiple clients accessing the same database.

It actually doesn't quite work most of the time, even with a regular PortalClient with or without a cache dir, but some of the processes have read timeout errors rather than the more dramatic database errors from the cached version. It's probably best in general not to share a PortalClient across threads.

Using with portal_client_manager(PortalClient), Pool(4) as pool: (removing the lambda that shares the same client), as your suggestion will restore the default to, has worked successfully on every run I've tried so far, so I think that's definitely the right default.

Do you think it's worth mentioning something about this in the documentation for portal_client_manager? The docs currently show an example of a function that returns a new PortalClient each time (which should be safe) rather than a lambda _: client. I'm not really sure how to warn against this without showing the lambda code in the docs, though.

Thanks for the update and for checking on multithreading. I've read this a few times and think I kinda understand, and my preference is to do the simplest safe thing. So I agree with reverting the default behavior. And updating the docs is optional, you might just copy your comment "It's not safe to share the same client across threads or to construct multiple clients accessing the same database" into the portal_client_manager docstring and call it a day.

openff/qcsubmit/_tests/results/test_results.py

openff/qcsubmit/utils/utils.py

docs/api.rst

openff/qcsubmit/utils/utils.py

docs/api.rst

ntBre added 2 commits June 26, 2024 15:19

try storing qcportal cache on clients

8f8c4d9

generate a filename not just directory

4c649b2

ntBre added 11 commits July 9, 2024 10:20

Revert "generate a filename not just directory"

6d44b85

This reverts commit 4c649b2.

Revert "try storing qcportal cache on clients"

6f3db65

This reverts commit 8f8c4d9.

Merge branch 'main' into get_records_with_cache

02e8484

add outline of CachedPortalClient with signatures from qcportal

da52d5e

default to using the cache

6555c20

hoping to trigger test failures

fill in todos, ensure record_ids are sequences

2a9ec5b

if given a single id, these methods have to return a single record

05e1d9b

comment out non-functional get_molecules, add note about root issue

fccfb9c

ensure unpack has a value

ce64b0f

delete get_molecules after ben's confirmation at the meeting

859ec44

factor out default cache dir

09f76c3

ntBre added 5 commits July 11, 2024 13:08

break the tests to find where updates are needed

f8ae094

re-export CachedPortalClient

e408f6d

fix get_optimizations and filter test

b119fcd

tests passing again

3041e13

unused import

af7b6f7

ntBre added 7 commits July 11, 2024 17:03

add private _no_session manager for CachedPortalClient

c4507c6

check that the cache works for optimizations and singlepoints

be7babe

start CachedPortalClient docs, copy init signature from qcportal

a740add

override __repr__

89ba82f

copy over and update docs from qcportal

9850fa3

actually add CachedPortalClient to the docs page

0f7c65e

fix link (I hope)

136912b

ntBre marked this pull request as ready for review July 12, 2024 19:29

ntBre requested a review from j-wags July 12, 2024 19:31

j-wags approved these changes Jul 15, 2024

View reviewed changes

ntBre added 5 commits July 16, 2024 14:20

test that different clients can share a cache_dir

9503c21

restore PortalClient to default, remove now-unused default cache dir

fb57acb

make CachedPortalClient private

bfa7e95

update warning message to be more explicit

938bba9

fix unused

5f3ff4e

ntBre commented Jul 16, 2024

View reviewed changes

docs/api.rst Show resolved Hide resolved

ntBre added 2 commits July 17, 2024 10:55

add a warning to portal_client_manager

c63daab

move warning up

9ab6ed6

ntBre merged commit ccedbd4 into main Jul 18, 2024
6 checks passed

ntBre mentioned this pull request Jul 22, 2024

Delete unnecessary test and context manager #292

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `get_records_with_cache` to cache `to_records` calls #286

Use `get_records_with_cache` to cache `to_records` calls #286

ntBre commented Jun 26, 2024 •

edited

Loading

ntBre commented Jun 27, 2024

ntBre commented Jun 27, 2024

ntBre commented Jul 10, 2024

ntBre commented Jul 11, 2024

ntBre commented Jul 12, 2024

j-wags left a comment

j-wags Jul 15, 2024

j-wags Jul 15, 2024

ntBre Jul 16, 2024

j-wags Jul 16, 2024

Use get_records_with_cache to cache to_records calls #286

Use get_records_with_cache to cache to_records calls #286

Conversation

ntBre commented Jun 26, 2024 • edited Loading

Description

Todos

Status

ntBre commented Jun 27, 2024

ntBre commented Jun 27, 2024

ntBre commented Jul 10, 2024

ntBre commented Jul 11, 2024

ntBre commented Jul 12, 2024

j-wags left a comment

Choose a reason for hiding this comment

j-wags Jul 15, 2024

Choose a reason for hiding this comment

j-wags Jul 15, 2024

Choose a reason for hiding this comment

ntBre Jul 16, 2024

Choose a reason for hiding this comment

j-wags Jul 16, 2024

Choose a reason for hiding this comment

Use `get_records_with_cache` to cache `to_records` calls #286

Use `get_records_with_cache` to cache `to_records` calls #286

ntBre commented Jun 26, 2024 •

edited

Loading