Replies: 2 comments
-
ListObjects operation is somehow different to local storage that S3 list rate is really low. This makes listing of CC store really painful. Basically you are suggesting adding another prefix level to the file path, ending up with 280 entries/stations. So I agree on 1, 3 and 4. For 2., I think it would making StackStore more intuitive. Might be helpful for @kuanfufeng? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Close because they are implemented. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'd like to consider some changes to the store interfaces and the directory structures we use.
The main driver for these changes is the data sets sizes we are working with as we move to cloud scale. Eg.
280
stations ->~40k
station pairs per-day ->14M
files a yearThis means that operations like listing the files in a cloud bucket become very expensive. We need to strike a balance between making a large number of small calls and making a few very large calls. With this in mind, here are the proposes changes:
This way each level has a manageable number of entries. For example, to retrieve the set of computed CCs we can make
~280
parallel calls where each one fetches~100k
entries per year. In contrast, with the current layout we either make40k
calls or one call that fetches14M
objects/year.Include the timespan in the stack filename. Today when we stack from
t1
tot2
we output a file named by the pair, e.g.CI.ARV_CI.BAK.tar.gz
. The proposal is to use the above structure, where the pair is the parent directory and the timespan is the file name. E.g if we stack a month we would have:CI.ARV/CI.ARV_CI.BAK/2023_07_01T2023_07_31.tar.gz
. This allows us to: a) know the stacked timespan from the filename, and b) stack other timespans and store them in the same bucket.Change the
get_timespans()
method in both CC and Stack stores to take a station pair as an argument. This means we'll be able to query the timespans available for a given station pair (i.e.get_timespans(pair)
), but not all timespans available across the entire store. Supporting the latter requires us to list the potentially hundreds of millions of objects in a bucket in one call.Similarly, the
read_stacks
method would require atimespan
argument.Note that a nice side effect of these changes is that both CC and Stack stores have the same directory structure and file naming, and the interfaces become very similar and consistent. This will help simplify both their implementation as well as their use.
See this RFC PR for the proposed interface changes: #230
Beta Was this translation helpful? Give feedback.
All reactions