Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to distinguish Temporary and Permanent Exceptions in SQS Batch Processing #21

Open
aagrawal207 opened this issue Jun 14, 2021 · 13 comments
Labels
all_runtimes Changes that should be applied to all runtimes Batch Batch processing utility Proposed Community submited

Comments

@aagrawal207
Copy link

Is your feature request related to a problem? Please describe.
While processing SQS messages in batch, I want to distinguish between Temporary and Permanent Exceptions so that in case of temporary failures, message return to the main queue and in case of permanent failure, it moves to either DLQ or deleted completely as per the requirement.

Describe the solution you'd like
Simple conversion of exception thrown from SqsMessageHandler implementation from their respective exception to TemporaryFailureException or PermanentFailureException with some defined logic. Then handler logic defined as well for those exceptions.

Describe alternatives you've considered
Writing my own alternative exception handler which I am still figuring out.

Additional context
This seems to me a fairly common use-case and should be a well-understood problem.

@pankajagrawal16
Copy link

Hi @as2d3 Thanks for opening the feature request. I am wondering what would the UX be like? If I understand correctly, Intention here is to predict which of the exceptions thrown while processing an event is temporary or permanent?

Since API don't control what is the logic inside implementation of SqsMessageHandler, it's difficult to predict for the library which ones are permanent/temporary.

Could you please elaborate a bit on this? With some examples?

@aagrawal207
Copy link
Author

Hi @pankajagrawal16, currently SQS Batch processing throws only SQSBatchProcessingException. Basically, I am two new exceptions instead of this old one, something like SQSBatchProcessingTemporaryException and SQSBatchProcessingPermanentException.

So, currently I have implemented my own exceptions like these:

public class NotificationValidationException extends SQSBatchProcessingException {
    public NotificationValidationException(final String message, final Throwable cause) {
        super(message, cause);
    }
}

and

public class MyServiceRetryableException extends SQSBatchProcessingException {
    public NotificationValidationException(final String message, final Throwable cause) {
        super(message, cause);
    }
}

My SQS processor throws NotificationValidationException and MyServiceRetryableException instead of SQSBatchProcessingException. My batch processing Lambda Handler catches SQSBatchProcessingException like this:

public String handleRequest(final SQSEvent sqsEvent, final Context context) {
        try {
            LOGGER.debug("Processing SQS message batch with SQSEvent: {}", sqsEvent);
            awsLambdaPowertoolsSqsUtilsWrapper.batchProcessor(sqsEvent, notificationProcessor);
        } catch (SQSBatchProcessingException sqsBatchProcessingException) {
            final List<SQSEvent.SQSMessage> failures = sqsBatchProcessingException.getFailures();
            LOGGER.error("(Partial) batch of SQS messages failed: {}", failures);
            throw sqsBatchProcessingException;
        }

        return "{\"statusCode\": 200}";
}

The issue is that after the handler processes the batch, all the failed messages move back to main queue. Instead the logic could be like this:

public String handleRequest(final SQSEvent sqsEvent, final Context context) {
        try {
            LOGGER.debug("Processing SQS message batch with SQSEvent: {}", sqsEvent);
            awsLambdaPowertoolsSqsUtilsWrapper.batchProcessor(sqsEvent, notificationProcessor);
        } catch (SQSBatchProcessingTemporaryException sqsBatchProcessingTemporaryException) {
            final List<SQSEvent.SQSMessage> failures = sqsBatchProcessingTemporaryException.getFailures();
            LOGGER.error("(Partial) batch of SQS messages failed, will be moved back to main queue: {}", failures);
            throw sqsBatchProcessingTemporaryException;
        } catch (SQSBatchProcessingPermanentException sqsBatchProcessingPermanentException) {
            final List<SQSEvent.SQSMessage> failures = sqsBatchProcessingPermanentException.getFailures();
            LOGGER.error("(Partial) batch of SQS messages failed, will be moved to DLQ directly: {}", failures);
            throw sqsBatchProcessingPermanentException;
        }

        return "{\"statusCode\": 200}";
}

Here, the ask is basically, if the message schema validation type exception is thrown, then there is no reason to move it back to the main queue and process it again later. We would like to directly move these to DLQ or delete it.

@pankajagrawal16
Copy link

Thanks for the response @as2d3. I understand what is the intent of the feature request now.

That being said, it will be too much assumption on behalf of library to classify which of the errors/exceptions should be retried and which one should not be. That's why we provide generic implementation with SQSBatchProcessingException wrapping all the details of exception which is thrown by user provided implemenation of SqsMessageHandler.

So in your use case, the SQSBatchProcessingException still has all the underlying exception for you to decide what should be done with the SqsEvent next. It might be, you will like to move some to DLQ or want them to be retried.

I guess one way to support this use case better will be to accept from user which exception types are classified as "Move to DLQ exception" and then the utility can take care of moving those messages to DLQ instead of letting them being retried.

@aagrawal207
Copy link
Author

I just wanted to present an example, design could be something else. What I really want from the library is to provide an easy way to move the failed message directly to DLQ instead of moving it back to the main queue. I think this is a very useful use-case and (imho) is a must feature of an SQS consumer library.

Library does not need to define temporary and permanent exceptions I would say, but there should be a simple utility to move the message to DLQ or delete it completely instead which can be used in exception handling at implementation level.

@pankajagrawal16
Copy link

I just wanted to present an example, design could be something else. What I really want from the library is to provide an easy way to move the failed message directly to DLQ instead of moving it back to the main queue. I think this is a very useful use-case and (imho) is a must feature of an SQS consumer library.

Library does not need to define temporary and permanent exceptions I would say, but there should be a simple utility to move the message to DLQ or delete it completely instead which can be used in exception handling at implementation level.

Rite, I definitely agree with that. It will be an excellent feature to be supported as part of this utility.

@pankajagrawal16 pankajagrawal16 transferred this issue from aws-powertools/powertools-lambda-java Jul 6, 2021
@pankajagrawal16
Copy link

pankajagrawal16 commented Jul 6, 2021

@heitorlessa I brought this issue to roadmap now, Since i believe this might be an interesting use case to support in both python and java version as both has SQS batch processor utility.

Idea here is to provide an easy way where user can configure utility to mark certain failures while processing a SQS message in the batch to be moved to a DLQ.

Initial thought was maybe we could have api accept a list of user supplied exception, which when thrown while processing the message, utility can move those messages to DLQ instead.

Let me know your thoughts?

@heitorlessa
Copy link
Contributor

Just moved to backlog as we should do it. @cakepietoast and I discussed in the original RFC but didn't come around to do it.

That said, @pankajagrawal16 do you wanna create a RFC so we can discuss some implementation details? We can link to this issue, then tackle that until TS and C# work on their core implementations.

This will help @saragerion @sliedig @t1agob to chime in as Exceptions don't translate well in TS, for example.

@pankajagrawal16
Copy link

@heitorlessa Sounds good. I will do that sometime today. I have some ideas on how this can look from UX perspective in Java.

@heitorlessa
Copy link
Contributor

heitorlessa commented Jul 27, 2021 via email

@pankajagrawal16
Copy link

@heitorlessa Here you go #29

@heitorlessa heitorlessa added all_runtimes Changes that should be applied to all runtimes and removed Java Python labels Aug 11, 2021
@heitorlessa
Copy link
Contributor

cc @saragerion @cakepietoast as I added all_runtimes tag

@pankajagrawal16
Copy link

@aagrawal207
Copy link
Author

aagrawal207 commented Jan 10, 2022

Hi @pankajagrawal16,

Thanks for implementing the feature request so quickly. I tried the change recently and have few pointers:

  1. When an SQS message comes under the nonRetryableException list, the exception for it is consumed without any trace for it even when suppressException is false. I think there should be a way to know which of the messages moved to DLQ and why. I understand the intention is to safely exit the function if the messages are moved to DLQ or deleted but then we might want to give that data to upstream which is currently not possible.
  2. In Java implementation, use of generic and varargs combination (Class<? extends Exception>... nonRetryableExceptions) is generally not preferred. I understand that this might be done to keep methods backward compatible but a better way would have been to introduce a list of exceptions instead.
  3. The list of arguments and return type might keep increasing as more and more feature requests like this will come in future. Instead of including more and more arguments, preferred way would be to introduce a single request and response object and put the attributes in that. Currently, there are too many boolean requests which make the method very ugly and difficult to use. Also, if you see the new 2.0 version AwsJavaSdk, they also make a single request and response object instead of take all the attributes of the request individually.

My usecase: I am invoking my Lambda function directly from my integration tests for both positive happy paths and negative paths as well. For the negative paths, I configured the messages to be completely deleted which is happening as per the requirement. This issue I am facing is that the exception for which the message is deleted is not reported back to the user, and so I am unable to validate the same in my integration tests. The only way the user can validate it is by checking logs manually for the deleted message.

I would love to know your thoughts on this.

@heitorlessa heitorlessa removed enhancement New feature or request Proposed Community submited Batch Batch processing utility labels Apr 28, 2022
@heitorlessa heitorlessa added Proposed Community submited Batch Batch processing utility all_runtimes Changes that should be applied to all runtimes and removed all_runtimes Changes that should be applied to all runtimes labels Apr 28, 2022
@sthulb sthulb moved this from Triage to Backlog in Powertools for AWS Lambda (Python) Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
all_runtimes Changes that should be applied to all runtimes Batch Batch processing utility Proposed Community submited
Projects
Status: Backlog
Development

No branches or pull requests

3 participants