Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Compression Policy Stalling After Upgrade to TimescaleDB 2.16.2 #7502

Open
rdmolony opened this issue Nov 29, 2024 · 7 comments
Open

Comments

@rdmolony
Copy link

rdmolony commented Nov 29, 2024

What type of bug is this?

Crash

What subsystems and features are affected?

Compression

What happened?

TL;DR; -

My database's compression policy job (CALL _timescaledb_functions.policy_compression()) has not run successfully since I upgraded from version 2.15.3 to 2.16.2, approximately four months ago. It appears to get stuck on specific chunks during compression.

The issue has been resolved by running a SQL query to manually decompress/compress all hypertable chunks - see the SQL query at the bottom for more info.


Recently, I noticed that the periodic job CALL _timescaledb_functions.policy_compression() was running indefinitely.

So I looked at this job's stats via ...

select * from timescaledb_information.job_stats
job_id application_name schedule_interval max_runtime max_retries retry_period proc_schema proc_name owner scheduled fixed_schedule config next_start initial_start hypertable_schema hypertable_name check_schema check_name
1014 Compression Policy [1014] 3 days 12:00:00 00:00:00 -1 01:00:00 _timescaledb_functions policy_compression postgres true false {"hypertable_id": 61, "compress_after": "7 days"} -infinity public sensor_reading_generic _timescaledb_functions policy_compression_check

... and saw that the last time it ran successfully was 4 months ago.

This timing corresponds with an upgrade of Timescale from version 2.15.3 to 2.16.2

I noticed that the compression policy query CALL _timescaledb_functions.policy_compression() was getting stuck on one particular hypertable chunk by running ...

with query as (
  select pid
    from pg_stat_activity
   where query = 'CALL _timescaledb_functions.policy_compression()'
)
select
    l.relation::regclass AS locked_relation,
    (chunk_schema || '.' || chunk_name)::regclass,
    c.range_start,
    l.mode,
    l.locktype,
    a.query,
    a.state,
    a.wait_event_type,
    a.wait_event
from
    pg_stat_activity a
    join pg_locks l  on l.pid = a.pid
    join query q on q.pid = l.pid
    left join timescaledb_information.chunks c on (chunk_schema || '.' || chunk_name)::regclass = l.relation::regclass

... while it was executing.

Note - the compression policy holds a lock on each hypertable it is compressing. So long as it is stuck on a specific hypertable, backfilling to this hypertable (i.e. importing historic readings) doesn't work work either, since INSERTs are blocked by this lock.

This hasn't been a major issue since backfilled historic data is mostly non-critical.

I found a relevant similar issue (thanks to ChatGPT 4o) that suggested manually compressing the chunk ...

Timescale Community Forum: In a forum post, a user described that all their compression policy jobs had been failing for approximately six weeks. The failures started after upgrading the Timescale extension. Manually running the jobs succeeded, but scheduled executions failed. The user sought advice on obtaining more detailed error information to diagnose the problem. 

So I tried this via ...

select compress_chunk('_timescaledb_internal.<chunk_name>')

... and found that the query ran indefinitely.

So (dumbly) I tried decompressing the chunk manually via ...

select decompress_chunk('_timescaledb_internal.<chunk_name>')

... and it eventually decompressed after 40 minutes! Then I recompressed it, which this time took less than a minute!

I don't understand quite what went wrong or how this fixed it. ChatGPT 4o believed that this decompress/compress resolved metadata inconsistencies & bloat - but my understanding is not good enough to dig into that.

So I re-run the compression policy, this time it got stuck on another chunk.

So now I'm running decompress/compress on all chunks via ...

DO $$
DECLARE
    chunk RECORD;
BEGIN
    FOR chunk IN
        SELECT chunk_schema, chunk_name
        FROM timescaledb_information.chunks
        WHERE is_compressed = true
    LOOP
        -- Step 1: Decompress the chunk
        RAISE NOTICE 'Decompressing chunk: %.%', chunk.chunk_schema, chunk.chunk_name;
        EXECUTE format('SELECT decompress_chunk(''%I.%I'')', chunk.chunk_schema, chunk.chunk_name);

        -- Step 2: Compress the chunk
        RAISE NOTICE 'Compressing chunk: %.%', chunk.chunk_schema, chunk.chunk_name;
        EXECUTE format('SELECT compress_chunk(''%I.%I'')', chunk.chunk_schema, chunk.chunk_name);
    END LOOP;
END $$;

Fingers crossed this resolves it

TimescaleDB version affected

2.16.2

PostgreSQL version used

PostgreSQL 14.13, compiled by Visual C++ build 1940, 64-bit

What operating system did you use?

Windows Server 2019 Datacenter 17763.6532

What installation method did you use?

Other

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

This might be hard to do other than upgrading from version 2.15.3 to 2.16.2 on a Windows server (if I've understood the issue)

@rdmolony rdmolony added the bug label Nov 29, 2024
@rdmolony
Copy link
Author

rdmolony commented Nov 29, 2024

Also note that (if I'm not mistaken) I should be able to set a timeout on the compression policy like ...

SELECT add_compression_policy('<hypertable-name>', INTERVAL '1 week');
WITH job AS (
    SELECT job_id
    FROM timescaledb_information.jobs
    WHERE hypertable_name = '<hypertable-name>'
        AND proc_name = 'policy_compression'
)
SELECT alter_job(
    job_id => job.job_id,
    max_runtime => INTERVAL '1 hour',
    max_retries => 3,
    retry_period => INTERVAL '10 minutes'
)

@rdmolony
Copy link
Author

rdmolony commented Dec 2, 2024

I can confirm that after running the decompress/compress SQL query above compression jobs are now back running without any issues.

@erimatnor
Copy link
Contributor

@rdmolony Thank you for reporting this issue. Do you think this is issue is specific to Windows?

@rdmolony
Copy link
Author

rdmolony commented Dec 3, 2024

Thanks @erimatnor, it could well be, I can't be sure

@normalnormie
Copy link

I was also stuck with "CALL _timescaledb_functions.policy_compression()"
I installed TimescaleDB some days ago and was not having issues until today
TimescaleDB version affected:
timescaledb-2-postgresql-14 (2.17.2~ubuntu22.04)
PostgreSQL version used:
(PostgreSQL) 14.12 (Ubuntu 14.12-0ubuntu0.22.04.1)
How can we reproduce the bug?
Followed https://docs.timescale.com/self-hosted/latest/install/installation-linux/ and inserted data for some days, the specific chunk that freezes policy_compression is from the biggest table and contains 57 million rows, also I had some out-of-space on disk past days that were solved before policy_compression started.

Used @rdmolony script but got stuck on: "SELECT decompress_chunk('_timescaledb_internal._hyper_25_27_chunk')" with a "Lock: relation"
Followed https://docs.timescale.com/use-timescale/latest/compression/compression-policy/ and paused the compression policy first by finding its job_id and then disabling it, script operation finished then enabled it back.

@normalnormie
Copy link

I was also stuck with "CALL _timescaledb_functions.policy_compression()" I installed TimescaleDB some days ago and was not having issues until today TimescaleDB version affected: timescaledb-2-postgresql-14 (2.17.2~ubuntu22.04) PostgreSQL version used: (PostgreSQL) 14.12 (Ubuntu 14.12-0ubuntu0.22.04.1) How can we reproduce the bug? Followed https://docs.timescale.com/self-hosted/latest/install/installation-linux/ and inserted data for some days, the specific chunk that freezes policy_compression is from the biggest table and contains 57 million rows, also I had some out-of-space on disk past days that were solved before policy_compression started.

Used @rdmolony script but got stuck on: "SELECT decompress_chunk('_timescaledb_internal._hyper_25_27_chunk')" with a "Lock: relation" Followed https://docs.timescale.com/use-timescale/latest/compression/compression-policy/ and paused the compression policy first by finding its job_id and then disabling it, script operation finished then enabled it back.

An update: I detected the same problem but this time decided to wait some days to see if it solved on its own and it did, my app was stopped for a couple hours so it was probably new inserts blocking the compression job.

@antekresic
Copy link
Contributor

Unfortunately, there are instances where compression policy can take a significant amount of time to re-compress a chunk which is what I think people are hitting here. This one of the things we are actively working on improving. One of the improvements is landing in January: #7482

Meanwhile, decompressing and compressing these problematic chunks manually is your best bet.

There are plans to work on lock contention here as well which can be caused by the compression policy and inserts into the chunk which should be compressed. But in the meantime, I suggest adjusting the policy to run on chunks which don't have data inserted into them anymore. For instance, if you are inserting data over the last 3 days, compression policy should compress chunks which are at least older than that. I realize this cannot be done for all workloads and that's why we plan on addressing this in the near future.

Hope this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants