Add optional `force` argument to `refresh_continuous_aggregate` #7521

zilder · 2024-12-05T17:29:06Z

Following the SDC. Continuous aggregates that are built on top of tiered hypertables cannot access and materialize tiered data when the timescaledb.enable_tiered_reads GUC is disabled at the server level. And there is no way for a user to manually force the refresh with tiered reads enabled. Here we add an optional force parameter to the refresh_continuous_aggregate procedure that would allow user to partially re-materialize cagg within a time window that has already been materialized.

fabriziomello · 2024-12-10T20:30:38Z

tsl/test/expected/cagg_refresh.out

+-- Test forceful refreshment. Here we simulate the situation that we've seen
+-- with tiered data when `timescaledb.enable_tiered_reads` were disabled on the
+-- server level. In that case we would not see materialized tiered data and
+-- we wouldn't be able to re-materialize the data using a normal refresh call
+-- because it would skip previously materialized ranges, but it should be
+-- possible with `force=>true` parameter. To simulate this use-case we clear
+-- the materialization hypertable and forefully re-materialize it.
+SELECT ht.schema_name || '.' || ht.table_name AS mat_ht, mat_hypertable_id FROM _timescaledb_catalog.continuous_agg cagg
+JOIN _timescaledb_catalog.hypertable ht ON cagg.mat_hypertable_id = ht.id
+WHERE user_view_name = 'daily_temp' \gset
+-- Delete the data from the materialization hypertable
+DELETE FROM :mat_ht;
+SET timezone TO UTC;
+-- Run regular refresh, it should not touch previously materialized range
+CALL refresh_continuous_aggregate('daily_temp', '2020-05-04 00:00 UTC', '2020-05-05 00:00 UTC');
+NOTICE:  continuous aggregate "daily_temp" is already up-to-date
+SELECT * FROM daily_temp
+ORDER BY day, device;
+ day | device | avg_temp 
+-----+--------+----------
+(0 rows)
+
+-- Run it again with force=>true, the data should be rematerialized
+CALL refresh_continuous_aggregate('daily_temp', '2020-05-04 00:00 UTC', '2020-05-05 00:00 UTC', force=>true);
+SELECT * FROM daily_temp
+ORDER BY day, device;
+             day              | device |     avg_temp     
+------------------------------+--------+------------------
+ Mon May 04 00:00:00 2020 UTC |      0 | 15.7647058823529
+ Mon May 04 00:00:00 2020 UTC |      1 | 24.3142857142857
+ Mon May 04 00:00:00 2020 UTC |      2 | 14.8205128205128
+ Mon May 04 00:00:00 2020 UTC |      3 | 18.1111111111111
+ Tue May 05 00:00:00 2020 UTC |      0 | 19.3846153846154
+ Tue May 05 00:00:00 2020 UTC |      1 | 16.5555555555556
+ Tue May 05 00:00:00 2020 UTC |      2 | 18.5714285714286
+ Tue May 05 00:00:00 2020 UTC |      3 | 23.5714285714286
+(8 rows)
+


Would be good to move those tests to include/cagg_refresh_common.sql so we will be able to test using the current materialization code (DELETE + INSERT) and the new one (MERGE + DELETE)

I moved the test to chunk_utils_internal.sql to test it with fdw chunk. Do you think it is necessary to add a test to cagg_refresh_common? I haven't touched the materialization code, only the part where the invalidation records combine.

erimatnor · 2024-12-11T10:35:20Z

I think we discussed this previously, and one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on". This would not require changes to our API and is the natural way to make the region refresh with current mechanisms. Was that approach discarded for some reason? If so, can someone explain what disqualified it?

zilder · 2024-12-11T13:33:03Z

@erimatnor

one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on"

I probably didn't catch that. Do you mean that in the GUC callback we set some sort of a flag that would indicate that we should insert an invalidation record whenever we refresh a range? I see a couple of issues:

Customer decides to change the GUC on server level. Now if they want to refresh missing aggregates over tiered data they would need to turn the GUC "off" and then "on" again.
This would alter the behavior of the refresh procedure for regular caggs (over non-tiered data).

In both cases it feels like not a clear UX.

fabriziomello · 2024-12-11T14:10:19Z

I think we discussed this previously, and one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on". This would not require changes to our API and is the natural way to make the region refresh with current mechanisms. Was that approach discarded for some reason? If so, can someone explain what disqualified it?

The problem with this approach is the case when the tiered data is already materialized that will generate unnecessary invalidation logs.

erimatnor · 2024-12-11T14:48:22Z

I think we discussed this previously, and one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on". This would not require changes to our API and is the natural way to make the region refresh with current mechanisms. Was that approach discarded for some reason? If so, can someone explain what disqualified it?

The problem with this approach is the case when the tiered data is already materialized that will generate unnecessary invalidation logs.

@erimatnor

one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on"

I probably didn't catch that. Do you mean that in the GUC callback we set some sort of a flag that would indicate that we should insert an invalidation record whenever we refresh a range? I see a couple of issues:

I meant adding the invalidations in the GUC, not setting a flag. Nothing else needs to change.

1. Customer decides to change the GUC on server level. Now if they want to refresh missing aggregates over tiered data they would need to turn the GUC "off" and then "on" again.

Not sure I follow why they'd have to turn off and on again. I believe guc callbacks are called on parsing the server config file as well.

2. This would alter the behavior of the refresh procedure for regular caggs (over non-tiered data).

I don't think this would be necessary if the invalidations are added via the guc callback.

In both cases it feels like not a clear UX.

The idea is that this would be completely transparent and no need for user to learn about a force parameter, so IMO the UX is better because it "just works" as expected.

With the other approach, the user first has to enable the GUC for tiered caggs to work, then use the force parameter for the first refresh. That seems like a strictly worse UX to me.

erimatnor · 2024-12-11T14:49:39Z

I think we discussed this previously, and one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on". This would not require changes to our API and is the natural way to make the region refresh with current mechanisms. Was that approach discarded for some reason? If so, can someone explain what disqualified it?

The problem with this approach is the case when the tiered data is already materialized that will generate unnecessary invalidation logs.

That's just one large invalidation entry per affected cagg. It should not be an issue right? Also, can't we detect whether it is up-to-date or not and avoid adding the invalidation? If we can't detect it, it would be "correct" to add the invalidation AFAICT.

fabriziomello · 2024-12-11T16:52:24Z

I think we discussed this previously, and one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on". This would not require changes to our API and is the natural way to make the region refresh with current mechanisms. Was that approach discarded for some reason? If so, can someone explain what disqualified it?

The problem with this approach is the case when the tiered data is already materialized that will generate unnecessary invalidation logs.

That's just one large invalidation entry per affected cagg. It should not be an issue right?

It is a problem (actually this is exact the problem I want to avoid) because the current architecture don't deal with large ranges very well and doing that will make things worse for the customer.

Also, can't we detect whether it is up-to-date or not and avoid adding the invalidation? If we can't detect it, it would be "correct" to add the invalidation AFAICT.

We can check if exists some tiered data already materialized but it don't guarantee that everything is actually materialized and this is the reason to have this force to make sure we'll rematerialize what is needed.

erimatnor · 2024-12-16T08:12:48Z

I think we discussed this previously, and one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on". This would not require changes to our API and is the natural way to make the region refresh with current mechanisms. Was that approach discarded for some reason? If so, can someone explain what disqualified it?

The problem with this approach is the case when the tiered data is already materialized that will generate unnecessary invalidation logs.

That's just one large invalidation entry per affected cagg. It should not be an issue right?

It is a problem (actually this is exact the problem I want to avoid) because the current architecture don't deal with large ranges very well and doing that will make things worse for the customer.

Ok, sounds like a bigger problem then. It is just that my alarm bells start to ring when I see that we are trying to build around the current architecture. I'd prefer we fix the underlying problem, but maybe this is not possible in the short term.

We can check if exists some tiered data already materialized but it don't guarantee that everything is actually materialized and this is the reason to have this force to make sure we'll rematerialize what is needed.

But it will only work for one cagg, so you'd have to manually force refresh all caggs that are connected to a tiered hypertable. If you forget one or more, you are now actually in a bad state where some caggs aren't accurately representing the data even if you refresh normally. In contrast, adding the invalidation will add the proper state to remember that this region is no longer valid and needs a refresh. This is what invalidations are for. But now we are saying we cannot use invalidations because they don't work well. So, now we use them sometimes, but in other times we don't. To me that sounds like a system where you cannot fully trust that a refresh is actually a refresh, which might lead to users always forcing a refresh just to be sure.

Anyway, I am just pointing these things out to make sure we thought this through, but if you are convinced that relying on the user forcing a refresh is the best option right now, then I trust you that there is no better option.

fabriziomello · 2024-12-16T14:04:42Z

I think we discussed this previously, and one idea was to invalidate the region covered by the tiered data when the GUC to enable tiered refreshing was changed to "true/on". This would not require changes to our API and is the natural way to make the region refresh with current mechanisms. Was that approach discarded for some reason? If so, can someone explain what disqualified it?

The problem with this approach is the case when the tiered data is already materialized that will generate unnecessary invalidation logs.

That's just one large invalidation entry per affected cagg. It should not be an issue right?

It is a problem (actually this is exact the problem I want to avoid) because the current architecture don't deal with large ranges very well and doing that will make things worse for the customer.

Ok, sounds like a bigger problem then. It is just that my alarm bells start to ring when I see that we are trying to build around the current architecture. I'd prefer we fix the underlying problem, but maybe this is not possible in the short term.

Unfortunately no... the short term solution we're working on is build some custom jobs to execute the refresh in small ranges instead of a big one and this new option will be used by OSM team when they have this situation when the tiered reads are disabled then they enable it and want to materialize the data.

But it will only work for one cagg, so you'd have to manually force refresh all caggs that are connected to a tiered hypertable. If you forget one or more, you are now actually in a bad state where some caggs aren't accurately representing the data even if you refresh normally. In contrast, adding the invalidation will add the proper state to remember that this region is no longer valid and needs a refresh. This is what invalidations are for. But now we are saying we cannot use invalidations because they don't work well. So, now we use them sometimes, but in other times we don't. To me that sounds like a system where you cannot fully trust that a refresh is actually a refresh, which might lead to users always forcing a refresh just to be sure.

Anyway, I am just pointing these things out to make sure we thought this through, but if you are convinced that relying on the user forcing a refresh is the best option right now, then I trust you that there is no better option.

Your points are valid and many thanks to jump into the discussion, but this option is not for wide audience but instead to ourselves use it in some custom jobs to execute the refresh in an incremental fashion. We don't want to rely on users to make sure that their data is in the correct state.

Anyway all of this is for short term solution but starting on Q1 2025 Mats and I will start to work on a new architecture for refreshing caggs to address all this write amplification we have nowadays, but it is not a small project.

Continuous aggregates that are built on top of tiered hypertables cannot access and materialize tiered data when `timescaledb.enable_tiered_reads` GUC is disabled at the server level. And there is no way for a user to manually force the refresh with tiered reads enabled. Here we add an optional `force` parameter to the `refresh_continuous_aggregate` procedure that would allow user to partially re-materialize cagg within a time window that has been previously materialized.

zilder requested a review from fabriziomello December 5, 2024 17:29

fabriziomello assigned zilder Dec 5, 2024

fabriziomello added the continuous_aggregate label Dec 5, 2024

zilder force-pushed the zilder/force_refresh branch from af8c263 to bde8617 Compare December 9, 2024 09:58

fabriziomello reviewed Dec 10, 2024

View reviewed changes

zilder force-pushed the zilder/force_refresh branch from bde8617 to 40f1986 Compare December 16, 2024 21:35

fabriziomello added this to the TimescaleDB 2.18.0 milestone Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional `force` argument to `refresh_continuous_aggregate` #7521

Add optional `force` argument to `refresh_continuous_aggregate` #7521

zilder commented Dec 5, 2024

fabriziomello Dec 10, 2024

zilder Dec 16, 2024

erimatnor commented Dec 11, 2024

zilder commented Dec 11, 2024 •

edited

Loading

fabriziomello commented Dec 11, 2024

erimatnor commented Dec 11, 2024 •

edited

Loading

erimatnor commented Dec 11, 2024 •

edited

Loading

fabriziomello commented Dec 11, 2024

erimatnor commented Dec 16, 2024

fabriziomello commented Dec 16, 2024 •

edited

Loading

Add optional force argument to refresh_continuous_aggregate #7521

Are you sure you want to change the base?

Add optional force argument to refresh_continuous_aggregate #7521

Conversation

zilder commented Dec 5, 2024

fabriziomello Dec 10, 2024

Choose a reason for hiding this comment

zilder Dec 16, 2024

Choose a reason for hiding this comment

erimatnor commented Dec 11, 2024

zilder commented Dec 11, 2024 • edited Loading

fabriziomello commented Dec 11, 2024

erimatnor commented Dec 11, 2024 • edited Loading

erimatnor commented Dec 11, 2024 • edited Loading

fabriziomello commented Dec 11, 2024

erimatnor commented Dec 16, 2024

fabriziomello commented Dec 16, 2024 • edited Loading

Add optional `force` argument to `refresh_continuous_aggregate` #7521

Add optional `force` argument to `refresh_continuous_aggregate` #7521

zilder commented Dec 11, 2024 •

edited

Loading

erimatnor commented Dec 11, 2024 •

edited

Loading

erimatnor commented Dec 11, 2024 •

edited

Loading

fabriziomello commented Dec 16, 2024 •

edited

Loading