Changes to store interfaces #229

carlosgjs · 2023-09-28T19:59:40Z

carlosgjs
Sep 28, 2023
Maintainer

I'd like to consider some changes to the store interfaces and the directory structures we use.

The main driver for these changes is the data sets sizes we are working with as we move to cloud scale. Eg.

280 stations -> ~40k station pairs per-day -> 14M files a year

This means that operations like listing the files in a cloud bucket become very expensive. We need to strike a balance between making a large number of small calls and making a few very large calls. With this in mind, here are the proposes changes:

Storing the data with a 3 level hierarchy. E.g.

(omitting HH_MM_SS for brevity)
CI.ARV
        CI.ARV_CI.BAK
                2023_07_01T2023_07_02.tar.gz

This way each level has a manageable number of entries. For example, to retrieve the set of computed CCs we can make ~280 parallel calls where each one fetches ~100k entries per year. In contrast, with the current layout we either make 40k calls or one call that fetches 14M objects/year.

Include the timespan in the stack filename. Today when we stack from t1 to t2 we output a file named by the pair, e.g. CI.ARV_CI.BAK.tar.gz. The proposal is to use the above structure, where the pair is the parent directory and the timespan is the file name. E.g if we stack a month we would have: CI.ARV/CI.ARV_CI.BAK/2023_07_01T2023_07_31.tar.gz. This allows us to: a) know the stacked timespan from the filename, and b) stack other timespans and store them in the same bucket.
Change the get_timespans() method in both CC and Stack stores to take a station pair as an argument. This means we'll be able to query the timespans available for a given station pair (i.e. get_timespans(pair)), but not all timespans available across the entire store. Supporting the latter requires us to list the potentially hundreds of millions of objects in a bucket in one call.
Similarly, the read_stacks method would require a timespan argument.

Note that a nice side effect of these changes is that both CC and Stack stores have the same directory structure and file naming, and the interfaces become very similar and consistent. This will help simplify both their implementation as well as their use.

See this RFC PR for the proposed interface changes: #230

niyiyu · 2023-09-28T21:19:02Z

niyiyu
Sep 28, 2023
Maintainer

ListObjects operation is somehow different to local storage that S3 list rate is really low. This makes listing of CC store really painful. Basically you are suggesting adding another prefix level to the file path, ending up with 280 entries/stations. So I agree on 1, 3 and 4.

For 2., I think it would making StackStore more intuitive. Might be helpful for @kuanfufeng?

0 replies

niyiyu · 2024-10-18T16:44:59Z

niyiyu
Oct 18, 2024
Maintainer

Close because they are implemented.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoisePy

Changes to store interfaces #229

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

NoisePy

Changes to store interfaces #229

carlosgjs Sep 28, 2023 Maintainer

Replies: 2 comments

niyiyu Sep 28, 2023 Maintainer

niyiyu Oct 18, 2024 Maintainer

carlosgjs
Sep 28, 2023
Maintainer

niyiyu
Sep 28, 2023
Maintainer

niyiyu
Oct 18, 2024
Maintainer