Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple show jobs from CreateMviewProgressTracker #19189

Open
kwannoel opened this issue Oct 30, 2024 · 7 comments
Open

Decouple show jobs from CreateMviewProgressTracker #19189

kwannoel opened this issue Oct 30, 2024 · 7 comments
Assignees
Milestone

Comments

@kwannoel
Copy link
Contributor

We can decouple rw_ddl_progress from meta’s materialized view progress tracker, and maintain it adhoc:

  1. Make internal backfill state tables visible. They contain the backfilled row_count, and whether the backfill is finished or not.
    a. Query all the internal backfill state tables for an MV to fetch row_count and finished status.
    b. Query internal tables of the MV.
    c. Regex the name for backfill tables.
  2. Query the hummock version stats, so we can get the upstream row count.
  3. Calculate the estimated progress.

This will:
Simplify logic of materialized view progress tracker. We no longer need to maintain state for counts.
Allow us to track the backfill progress of created sink jobs.
In general, I think the same pattern can be applied to other forms of backfilling, snapshot backfill and shared source backfill.

@github-actions github-actions bot added this to the release-2.2 milestone Oct 30, 2024
@kwannoel
Copy link
Contributor Author

Implementation-wise, we need to consider querying permissions for the internal table.

@kwannoel kwannoel self-assigned this Oct 30, 2024
@kwannoel
Copy link
Contributor Author

This also has the added benefit where we can display finished jobs.

@kwannoel kwannoel modified the milestones: release-2.2, release-2.3 Dec 27, 2024
@kwannoel
Copy link
Contributor Author

kwannoel commented Jan 7, 2025

I've taken a look at implementing this. It can't really be implemented as a system catalog.

This is because the backfill internal state tables are dynamically queried. We can't statically construct a query to read from all backfilling tables. We can't query them via the meta node, nor can we query them via our batch interface (if it is constructed as a system catalog).

Instead, here's my idea:

  1. We have a handler for show jobs, so it does not call rw_ddl_progress under the hood.
  2. Inside this handler, there are two parts:
  3. First part:
    a. Fetch all backfilling relation ids from meta.
    b. Construct the batch query using this tuple (select row_count, is_finished from <backfill_internal_state_table>)
    c. Run the batch queries to get the progress stored in the state tables
  4. Second part:
    a. Just read the hummock metrics from the frontend for the tables being backfilled from.
  5. From these 2 parts, we can construct the overall progress for all backfilling relations.

We can also improve UX further, by introducing new syntax:
SHOW JOBS WHERE relation_name = ..., db_name = ..., schema_name = ...,
since we don't support general purpose querying.

Shortcomings:

  • There can be some inconsistency between hummock metrics and backfill progress state, since they are not atomically fetched. I think it is acceptable since we are just providing an estimate.

@BugenZhao
Copy link
Member

This is because the backfill internal state tables are dynamically queried.

Theoretically we can add a table function like table(id) to parameterize the tables to select, but I suppose it can be hard under current execution model. 🥵

  1. so it does not call rw_ddl_progress under the hood.

Then what will be it look like? Still querying the meta service for basic information?

@kwannoel
Copy link
Contributor Author

kwannoel commented Jan 8, 2025

This is because the backfill internal state tables are dynamically queried.

Theoretically we can add a table function like table(id) to parameterize the tables to select, but I suppose it can be hard under current execution model. 🥵

  1. so it does not call rw_ddl_progress under the hood.

Then what will be it look like? Still querying the meta service for basic information?

Oh I like the table function idea. I think it might be possible. Inside the binder we can query the catalog and resolve the table function to the actual backfill tables.

We can then add a separate system catalog which queries meta for basic info (specifically the hummock table stats).

Then we add another SQL view system catalog to query and join both of these.

In that way we don't need a special syntax.

@kwannoel
Copy link
Contributor Author

kwannoel commented Jan 10, 2025

I managed to make pretty good headway with this approach on the frontend side. I bound the table function and use an optimizer rule to rewrite it inside the optimizer, where we can construct the actual plan for it.

However, seems like the creating table catalog returned from the meta service is malformed. The reason is that vnode_count is not filled in, as of the time the meta node updates the catalog back to the FE service.

We need to update the table catalog back to FE, only after the catalog has this vnode_count field populated.

  • Can try convert vnode_count::placeholder -> 0.0 progress.

@kwannoel
Copy link
Contributor Author

Continue working on this after #17501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants