Google cloud run volume mounting #2133

loveeklund-osttra · 2024-12-11T08:45:30Z

Documentation description

I run DLT in google cloudrun and have noticed when I load big tables it can get OOM even if it writes to files, as cloudrun doesn't have any "real" storage. What I've been doing instead is mounting a storage bucket and using pipeline_dir to direct the pipeline to use that as the directory. This seems to work well for me in the cases I've tested. But I've also seen that there are limitations with mounting a storage bucket as a directory, listed here https://cloud.google.com/run/docs/configuring/jobs/cloud-storage-volume-mounts . It would be good to have someone who knows how DLT works under the hood take a look at this and see if these limitations might cause issues (For example if two or more processes/ threads would write to the same file etc). If the limitations wouldn't cause issues I think it would be nice to include a section about it here
https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-run
to help other in the future.

Are you a dlt user?

Yes, I run dlt in production.

rudolfix · 2024-12-12T09:23:58Z

@loveeklund-osttra dlt works correctly on mounted volumes. indeed there are limitations ie. lack of atomic renames. we are pretty sure we work with GCP mounted volumes: Composer maps data folder like this we support keeping working directories there.

lack of atomic renames may be a problem in certain cases. in your case AFAIK you'll start with always clean storage so you should not have problems with half commited files from previous runs.

you can also look at #2131 if you can read your data in the same order you'll be able to extract it chunk by chunk

loveeklund-osttra · 2024-12-12T10:30:20Z

Perfect, then I'll continue using it! Thanks for response!

I don't think it start with an empty folder if I just mount a volume, ( it uses same directory between runs and I can see previous runs content in storage after it is complete) but I call pipeline.drop() in the start anyways so shouldn't be a problem for me.

If you decide to add this into the documentation it can be worth mentioning that you probably want to up the rename-dir-limit . Otherwise it shows an error( I assume from when it tries to move files between directories), but complete anyways.

I did it like this in terraform

     volume_mounts {
          mount_path = "/usr/src/ingestion/pipeline_storage"
          name       = "pipeline_bucket"
        }
      volumes {
        name = "pipeline_bucket"
        gcs {
          bucket    = google_storage_bucket.dlt_pipeline_data_bucket.name
          read_only = false
          mount_options = [
            "rename-dir-limit=100000"
          ]
        }
      }

Also described here
https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts .

Re #2131
Most of my data is from sql-databases / apis and I have experimented a bit there with chunking using limit. One thing noticed with that though is that it seems like it "recreates" the generator, so It will run the query again every time I call pipeline.run() . So I don't really know when to stop updating, I guess I could do something like loaded_rows < limit, but a more "inbuilt" way to do it would also be nice, but might be super difficult to actually implement (haven't looked into the code).

rudolfix · 2024-12-13T14:33:31Z

thx for the doc suggestion. I added that to our performance guide.

yes we recreate the whole extract step when you use add_limit. if you have incremental loading enabled, we are able to do a WHERE query that will skip already loaded rows. make sure you set row_order to ASC in incremental. (and you need to wait for the PR above to be merged...)

then you can run your pipeline in a loop as long as there's a load_package in load_info you get from pipeline.run. no load package indicates no more data. you can also inspect the load metrics in load_info to get the number of rows processed

loveeklund-osttra · 2024-12-16T14:28:33Z

He, just wanted to come back to this, I've run with fuse now for a while and noticed that while it writes out files to storage buckets, it seems like it doesn't "flush" the files from memory ( it is as if they stay in a cache until the process quits). Meaning that it can still go OOM if the total size of all files loaded if more than the container limit. This seems to not be tied to DLT, I have tested just writing out files with a simple python script and I experience the same thing there. There might be some setting that I've missed in configuring GCSfuse, I'm looking into it. I just thought I'd let you know so you can adjust any documentation accordingly.

github-project-automation bot added this to dlt core library Dec 11, 2024

github-project-automation bot moved this to Todo in dlt core library Dec 11, 2024

rudolfix added the question Further information is requested label Dec 12, 2024

rudolfix moved this from Todo to In Progress in dlt core library Dec 12, 2024

rudolfix self-assigned this Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google cloud run volume mounting #2133

Google cloud run volume mounting #2133

loveeklund-osttra commented Dec 11, 2024

rudolfix commented Dec 12, 2024

loveeklund-osttra commented Dec 12, 2024

rudolfix commented Dec 13, 2024

loveeklund-osttra commented Dec 16, 2024

Google cloud run volume mounting #2133

Google cloud run volume mounting #2133

Comments

loveeklund-osttra commented Dec 11, 2024

Documentation description

Are you a dlt user?

rudolfix commented Dec 12, 2024

loveeklund-osttra commented Dec 12, 2024

rudolfix commented Dec 13, 2024

loveeklund-osttra commented Dec 16, 2024