Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TASK: Overhaul CatchUpHook error behaviour; At least once delivery for ERROR projections #5392

Conversation

mhsdesign
Copy link
Member

@mhsdesign mhsdesign commented Dec 3, 2024

Adjustments for #5321

This pr originated after discussing NOT do do central locking #5383 and to simplify the save point creation.

A full list of discussed and here implemented changes can be seen here:
#5321 (comment)

The new catchup hook error behaviour - implemented via this pr - is documented in #4908 (comment) with great detail.

Additional technical information


Introduce batching in subscription engine

The replay uses a batch size of 500 (see ContentRepositoryMaintainer)
The regular catchup does not batch to ensure the catchup is run completely and any errors don't stop it
Batching does not respect the causation identifier, meaning a parent node event might be handled but not its tethered node creation event if the batch stops there - but the current 9.0 state is not different.


Introduce onAfterBatchCompleted hook

Previously the CatchUpHookInterface contained a onBeforeBatchCompleted which was explained as follows:

This hook is called directly before the database lock is RELEASED

It can happen that this method is called multiple times, even without having seen Events in the meantime.

If there exist more events which need to be processed, the database lock is directly acquired again after it is released.

onAfterBatchCompleted behaves similar just that it runs when the lock was released. This allows us to run all the tasks that onAfterCatchUp could also do (using possibly transactions and commiting work)

We dont ensure that onAfterBatchCompleted is only called if there are more events to handle, as this would complicate the code and technically without acquiring a lock there is never a guarantee for that, so hooks might as well encounter the case more often and have to deal with that.


Behaviour for catchup hooks and the same connection

Introduce test to assert behaviour when catchup hooks use the persistence manager (CatchUpHookWithPersistenceTest)

While not pretty, we rely on this working also for onAfterEvent in the redirect handler:

see https://github.com/neos/redirecthandler-neosadapter/blob/8c8f0cc6f63a030201f01cf16a7f70079149314a/Classes/Service/NodeRedirectService.php#L167

Commit the transaction during the catchup hook is evil and will not be respected. In that case the projection will be ahead of its position and an doctrine error will be thrown as no transaction is there to be committed anymore. Only a full replay can solve this if done.

We discussed that it might be sensible to use a dedicated connection for the full content repository - though for that we need some kind of abstract connection concept first.


Upgrade instructions

Review instructions

Checklist

  • Code follows the PSR-2 coding style
  • Tests have been created, run and adjusted as needed
  • The PR is created against the lowest maintained branch
  • Reviewer - PR Title is brief but complete and starts with FEATURE|TASK|BUGFIX
  • Reviewer - The first section explains the change briefly for change-logs
  • Reviewer - Breaking Changes are marked with !!! and have upgrade-instructions

Replaces: neos#5376

We agreed on using the default timeout for `FOR UPDATE` locks in the database. This test only failed previously because of the `NOWAIT`.
The `NOWAIT` was only needed when we had a lot of catchup processes which always did a for update for EACH event which was not a good combination.
Now when getting the lock once the default wait of the database is more sane and easier to implement than fetching the lowest position and always waiting.
CASE: $this->dbal->isTransactionActive() === false
…rsisted

Test provided by 54b24b8

This partially reverts moving `$this->subscriptionStore->transactional` outside again from eb0d792, as we have to do a task afterwards still
This reverts commit dc5ff10.

We discussed that we dont want to ensure exactly once delivery for external projections. Adding transaction logic via traits for the projections increases logic there. And while with alot of effort we could bring exactly once delivery to work for most cases php could still decide to die in the small timeframe where we commit the one transaction and then the second. So there is no gurantee except when using a dedicated store: neos#5377
… not put the projection in ERROR state

... instead a full rollback is done (which might be still changed)

- calls "internals" of `ProjectionSubscriber` directly from the subscription engine
- inline `handleEvent` method
@mhsdesign mhsdesign changed the base branch from 9.0 to feature/4746-rework-catchup-mechanism December 3, 2024 15:30
@mhsdesign mhsdesign changed the base branch from feature/4746-rework-catchup-mechanism to feature/4746-rework-catchup-mechanism-3 December 3, 2024 15:30
updates bfb4655 to not do a full rollback if catchup hooks fail but continue and collect their errors, which will be rethrown later with the advantage that the catchup worked!
Previously the `CatchUpHookInterface` contained a `onBeforeBatchCompleted` which was explained as follows:

This hook is called directly before the database lock is RELEASED

It can happen that this method is called multiple times, even without
having seen Events in the meantime.

If there exist more events which need to be processed, the database lock
is directly acquired again after it is released.

----------

`onAfterBatchCompleted` behaves similar just that it runs when the lock was released. This allows us to run all the tasks that `onAfterCatchUp` could also do (using possibly transactions and commiting work)

We dont ensure that `onAfterBatchCompleted` is only called if there are more events to handle, as this would complicate the code and technically without acquiring a lock there is never a guarantee for that, so hooks might as well encounter the case more often and have to deal with that.
@mhsdesign
Copy link
Member Author

FYI, while we decided against exactly once delivery for external projections, one could still use a decorator like below.

That also allows the original test c47d181 to pass:

    /** @before */
    public function injectExternalFakeProjection(): void
    {
        $entityManager = $this->getObject(EntityManagerInterface::class);

        if (!isset(self::$secondConnection)) {
            self::$secondConnection = DriverManager::getConnection(
                $entityManager->getConnection()->getParams(),
                $entityManager->getConfiguration(),
                $entityManager->getEventManager()
            );
        }

        $this->secondFakeProjection = new DebugEventProjection(
            sprintf('cr_%s_debug_projection', self::$contentRepositoryId->value),
            self::$secondConnection
        );

        FakeProjectionFactory::setProjection(
            'second',
            ExternalDbalProjection::decorateForConnection(
                $this->secondFakeProjection,
                self::$secondConnection
            )
        );
    }
final readonly class ExternalDbalProjection implements ProjectionInterface
{
    public function __construct(
        private ProjectionInterface $decorated,
        private Connection $dbal,
    ) {
    }

    public static function decorateForConnection(
        ProjectionInterface $projection,
        Connection $dbal,
    ): self {
        return new self($projection, $dbal);
    }

    public function setUp(): void
    {
        $this->decorated->setUp();
    }

    public function status(): ProjectionStatus
    {
        return $this->decorated->status();
    }

    public function apply(EventInterface $event, EventEnvelope $eventEnvelope): void
    {
        if ($this->dbal->isTransactionActive()) {
            throw new \RuntimeException(sprintf('Expected no transaction to be active while handling decorated apply for %s', $this->decorated::class), 1733255315);
        }
        $this->dbal->transactional(fn () => $this->decorated->apply($event, $eventEnvelope));
    }

    public function getState(): ProjectionStateInterface
    {
        return $this->decorated->getState();
    }

    public function resetState(): void
    {
        $this->decorated->resetState();
    }
}

but i decided against putting that in the source code - not even for testing. Instead there is an ExternalProjectionErrorTest with the note

Note that this test only documents the current state, there were ideas to guarantee exactly once delivery for external projections
but this would mean additional complexity and is technically never truly possible with two connections.
The most likely path this might work fully is by the introduction of a decentralized subscription store concept,
see #5377, but this is out of scope for now until we find good reasons.

@mhsdesign mhsdesign marked this pull request as ready for review December 7, 2024 20:25
@mhsdesign mhsdesign changed the title TASK: Subscription engine savepoint simplification TASK: Overhaul CatchUpHook error behaviour and simplify subscription engine save-points Dec 9, 2024
@bwaidelich bwaidelich changed the title TASK: Overhaul CatchUpHook error behaviour and simplify subscription engine save-points TASK: Overhaul CatchUpHook error behaviour and simplify SubscriptionEngine save-points Dec 10, 2024
@mhsdesign mhsdesign mentioned this pull request Dec 10, 2024
37 tasks
Similarly to that doctrine migrations (`flow doctrine:migrate`) are also cannot be rollback in case of an error

So we cannot wrap projection->setUp in transaction and errors (like in a migration) will be directly commited. Thus, it must be acted with care.
see neos#5321 (comment)

the save-point will only be used for REAL projection errors now and never rolled back if catchup errors occur.

With that change in code the save-points are less important because a real projection error should better be thrown at the start before any statements, and even if some statements were issued and a full rollback is done its unlikely that a reactivateSubscription helps that case.

Instead, to repair projections you should replay
@mhsdesign mhsdesign changed the title TASK: Overhaul CatchUpHook error behaviour and simplify SubscriptionEngine save-points TASK: Overhaul CatchUpHook error behaviour; At least once delivery for ERROR projections Dec 11, 2024
@mhsdesign mhsdesign merged commit 6423dc6 into neos:feature/4746-rework-catchup-mechanism-3 Dec 11, 2024
2 checks passed
@mhsdesign mhsdesign deleted the task/subscription-engine-savepoint-simplication branch December 11, 2024 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant