Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local object cache for cached_httpfs extension #88

Merged
merged 3 commits into from
Oct 10, 2024
Merged

Conversation

mkaruza
Copy link
Collaborator

@mkaruza mkaruza commented Aug 2, 2024

  • To cache remote object, user need to explicilty use duckdb.cache(path, type) function. Path is remote HTTPFS/S3/GCS/R2 object path and type is either parquet or csv indicating remote object type. Caching will not be triggered on normal SELECT queries.

  • When file is currently being cached and if we run in parallel another query targeting same remote object - second query will fail because it will not be able to take read lock.

  • PG_DATADIR/duckdb_cache will be used as cache directory

@mkaruza mkaruza requested a review from Tishj August 2, 2024 17:16
@mkaruza mkaruza force-pushed the cached_httpfs branch 2 times, most recently from c891c27 to 47cde7d Compare August 2, 2024 17:22
@mkaruza
Copy link
Collaborator Author

mkaruza commented Aug 2, 2024

FYI, if first connecton is caching remote file:

postgres=# select duckdb.cache('https://datasets.clickhouse.com/hits_compatible/hits.parquet');

Until this file is downloaded all other queries on same remote file will fail as:

postgres=# select count(*) from read_parquet('https://datasets.clickhouse.com/hits_compatible/hits.parquet') AS (v INT);
WARNING:  (DuckDB) IO Error: Could not set lock on file "/tmp/871ec48c8679b179e848071df4cfc5ef": Conflicting lock is held in /work/hydra/postgres/src/backend/postgres (PID 22174) by user mkaruza. See also https://duckdb.org/docs/connect/concurrency
ERROR:  Function `read_parquet(TEXT)` only works with Duckdb execution.
CONTEXT:  PL/pgSQL function read_parquet(text) line 3 at RAISE

@mkaruza mkaruza force-pushed the cached_httpfs branch 3 times, most recently from 8f948c9 to e1572de Compare August 4, 2024 06:14
@hlinnaka
Copy link
Collaborator

hlinnaka commented Aug 4, 2024

This seems useful in general, and not specific to pg_duck. How about doing this in the main duckdb repo, or as a separate extension?

@wuputah
Copy link
Collaborator

wuputah commented Aug 4, 2024

Until this file is downloaded all other queries on same remote file will fail

I understand the premise here, but ideally the query would detect the lock and then continue but access the file "directly" as if the caching wasn't present.

Another thing to consider: what if the original cache call is cancelled or otherwise interrupted? Aside from the lock, how does the cache ensure the file it finds is the complete file?

src/pgduckdb_node.cpp Outdated Show resolved Hide resolved
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
@Tishj
Copy link
Collaborator

Tishj commented Aug 5, 2024

The test fails because we changed how replacement scan names are assigned, ReplacementScanInput now has:

	const string &catalog_name;
	const string &schema_name;
	const string &table_name;

To get back the old behavior, you can use auto table_name = ReplacementScan::GetFullPath(input);

@Tishj
Copy link
Collaborator

Tishj commented Aug 5, 2024

I'd like to see this PR get split in 3:

  • Upgrade to newer duckdb version, apply the necessary changes to fix breaking tests
  • Copy HTTPFS extension to third_party/cached_httpfs without modifications, add the necessary changes to pg_duckdb for it
  • Add CachedFile and friends to the CachedHttpfsExtension, add additional changes to pg_duckdb relevant for this caching mechanism

@JelteF
Copy link
Collaborator

JelteF commented Aug 5, 2024

This seems useful in general, and not specific to pg_duck. How about doing this in the main duckdb repo, or as a separate extension?

Yeah, I think it makes much more sense to include this as a feature in the main httpfs extension. As opposed to maintaining a fork of that extension in this repo.

@wuputah
Copy link
Collaborator

wuputah commented Aug 5, 2024

Typically duckdb does not need this type of caching because when you use duckdb, you have access to the local filesystem. We can consider contributing it back upstream at some point but the decision was made to work on this here since it's more relevant to this use case.

Apply patches to that copied source to add the caching modifications

I get the point of being able to have the original code so we can diff it. Keeping this up to date with the upstream code is definitely a concern, but working with patches doesn't sound like fun either.

@JelteF
Copy link
Collaborator

JelteF commented Aug 5, 2024

Typically duckdb does not need this type of caching because when you use duckdb, you have access to the local filesystem.

I don't understand this argument (but I might be missing something). Afaict the data requested by httpfs is by definition not on the local filesystem (unless you use httpfs to connect to some webserver on localhost ofcourse). So providing a way to cache remote files onto the local filesystem seems pretty useful even for plain duckdb users of httpfs.

@wuputah wuputah added this to the 0.1.0 milestone Aug 5, 2024
@mkaruza mkaruza changed the base branch from main to third_party_httpfs August 7, 2024 08:47
@mkaruza mkaruza changed the title Adding cached_httpfs extension Local object cache for cached_httpfs extension Aug 7, 2024
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
}

CachedFile::~CachedFile() {
if (!initialized && handle) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handle should probably be closed before we remove the file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't think this condition is correct, we use initialized to indicate that the file is ready for reading, not that it was created by this process.

Even when it was created by this process, that doesn't necessarily mean that the file can be removed as another process can still be using the file. I'm not entirely sure what the condition should be here, but perhaps we need a interprocess reference count for the file.

I guess if we try to remove the file while another process is still using it, do we hang until it's closed or is the file unlinked and scheduled for removal when the other process closes the file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be hitting open file descriptor limitations if we're caching any decent amount of files which is also something to keep in mind, but perhaps figuring that out is a future PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When file doesn't exists, i.e when we cache file it will open file with WRITE_LOCK so all other processes cannot open file until that lock is released.
This logic is intended to clean up file if user interrupts caching (initialize = false & file handle != null) but as we, for now, don't have ability to do this i'll remove this logic in destructor.
As discussed, improvements for this extension will be done later, but it is good that we keep track of what could be potential problems.

@@ -323,17 +334,17 @@ unique_ptr<ResponseWrapper> HTTPFileSystem::GetRequest(FileHandle &handle, strin
hfh.state->total_bytes_received += data_length;
}
if (!hfh.cached_file_handle->GetCapacity()) {
hfh.cached_file_handle->AllocateBuffer(data_length);
hfh.cached_file_handle->Allocate(data_length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making sure that these are initialized files earlier on?
Since CachedFile::Allocate will throw an InternalException if initialized is set, this should be verified elsewhere or be downgraded to a regular exception as InternalExceptions are unrecoverable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this code around file read/write/reserve is copied from in-memory cache which need to check conditions but in our case (cache file is not available until completely downloaded). Methods that are used in file caching are used only in GET request which is issued once we have WRITE_LOCK and opened file handle. Changed all this check to be ASSERT. Comment applies to next review change request.

}

void CachedFileHandle::GrowFile(idx_t new_capacity, idx_t bytes_to_copy) {
file->handle->Trim(bytes_to_copy, new_capacity);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not need the same file->initialized check?

appendStringInfo(duckdb_data_directory, "%s/%s", DataDir, directory_name.c_str());

if (!CheckDirectory(duckdb_data_directory->data)) {
if (mkdir(duckdb_data_directory->data, S_IRWXU | S_IRWXG | S_IRWXO) == -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use MakePGDirectory instead

Suggested change
if (mkdir(duckdb_data_directory->data, S_IRWXU | S_IRWXG | S_IRWXO) == -1) {
if (MakePGDirectory(duckdb_data_directory->data) == -1) {

} // namespace pgduckdb

void
DuckdbCreateCacheDirectory() {
Copy link

@JohnHVancouver JohnHVancouver Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PG_DATADIR/duckdb_cache will be used as cache directory

Is the reason that you're choosing to create it in $PGDATA for convenience (e.g. directory guaranteed to exist with correct permission) or are there other reasons?

I think one implication of this is that backups such as pg_basebackup will also backup these cached files. I'm not sure if that's intentional or not, but if you expect the cache to be relatively large compared to the actual Postgres data files this might bloat backup size?

Another wrinkle would be if users are running pg_upgrade with --link mode, they would need to do some processing beforehand to retain the cached files but maybe that's not as big of a deal since it's relatively rare.

e.g.

mv $PGDATA/duckdb_cache duckdb_cache
pg_upgrade --old-datadir $PGDATA --new-datadir $NEW_PGDATA
mv duckdb_cache $NEW_PGDATA/duckdb_cache

Could always make the path configurable alongside #105 I suppose

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PG_DATADIR is the only thing we can be assured that Postgres can write to, aside from mktemp. But the intent was not to have that directory persist in backups, so that's a good point. A configurable option would be good.

@mkaruza mkaruza force-pushed the cached_httpfs branch 2 times, most recently from acd0366 to 8b05b25 Compare September 6, 2024 07:48
@JelteF JelteF added the enhancement New feature or request label Sep 30, 2024
@wuputah wuputah requested review from JelteF and wuputah October 4, 2024 16:16
* To cache remote object, user need to explicilty use
  `duckdb.cache(path, type)` function. Path is remote HTTPFS/S3/GCS/R2
  object path and type is either `parquet` or `csv` indicating remote
  object type. Caching will not be triggered on normal `SELECT` queries.

* When file is currently being cached and if we run in parallel another
  query targeting same remote object - second query will fail because it
  will not be able to take read lock.

* PG_DATADIR/duckdb_cache will be used as cache directory
* When MetadataCache for file is found we need to reuse Cached File
  information for reading in thread.
@wuputah wuputah merged commit 249cae8 into main Oct 10, 2024
4 checks passed
@wuputah wuputah deleted the cached_httpfs branch October 10, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants