Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get instance id for desired control-queue(s) #1069

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

pasaini-microsoft
Copy link

@pasaini-microsoft pasaini-microsoft commented Apr 19, 2024

Motivation

#1079

Issue: No way of targeting an orchestrator instance to a desired control-queue.

  • We have been facing issues where DTF orchestration used to get stuck at random. Given that customer load is not very regular in our service, it was challenging to understand upfront if the orchestration would be processed or will be stuck.
  • More often customers used to reach out with incidents complaining their request not completing for long time.
  • This is where we needed orchestration instances to observe health of each queue by targeting one instance for desired control-queue.

Motivation:

  • motivation was to reduce the TTD for finding if orchestration can be stuck/waiting-forever in a control-queue irrespective of the cause.

Issue: No way to load lightly loaded control-queues.

  • We have face a few situations where some of control-queues are overwhelmed with orchestration instances while the others are happily processing almost nothing.

Motivation:

  • motivation was to target new instances of orchestration instances to set control-queue which are not heavily loaded

Proposal

API to generate instance id for a set of control-queues.

  • This API receives set of control-queues and prefix for instance id.
  • Implementation detail is: Allow special way of creating instance id with a suffix unsigned integer after delimiter '!' and explicitly use that value to route to control-queue (say suffixNumber % partitionCount). If this pattern is not used, it would goes back to default (current) which is hash(instance-id)%partition-count.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

Comment on lines 62 to 69
controlQueueNumberToNameMap = new Dictionary<string, int>();

for (int i = 0; i < partitionCount; i++)
{
var controlQueueName = AzureStorageOrchestrationService.GetControlQueueName(settings.TaskHubName, i);
controlQueueNumberToNameMap[controlQueueName] = i;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we still using this in the new tests? No, right?

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this for external events as well?

Comment on lines 322 to 326
/// <summary>
/// Whether to allow instanceIDs to use special syntax to land on a specific partition.
/// If enabled, when an instanceID ends with suffix '!nnn', where 'nnn' is an unsigned number, the instance will land on the partition/queue for to that number.
/// </summary>
public bool EnableExplicitPartitionPlacement { get; set; } = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider - it is not safe to change this from false to true (or vice-versa) while an orchestrator with the special syntax is in-flight. If we do that, any pre-existing messages for that orchestrator may now be considered to be "in the wrong queue".

Let's call this out in the intellisense


int placementSeparatorPosition = instanceId.LastIndexOf('!');

// if the instance id ends with !nnn, where nnn is an unsigned number, it indicates explicit partition placement
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test that documents the behavior if the customer uses an instanceID with multiple ! in there? Say instanceID "A!1!B!3` should probably map to partition "3", right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test that checks that instanceID myinstanceID!NotANumber does not trigger any errors / that it correctly ignores the explicit placement logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the first test, I think we're just missing the very last one:

Let's also add a test that checks that instanceID myinstanceID!NotANumber does not trigger any errors / that it correctly ignores the explicit placement logic.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost ready to me, but please note there's some outstanding suggestions (and a missing test request from my last review).

FYI @gillum, whose approval we should seek before merging this.


int placementSeparatorPosition = instanceId.LastIndexOf('!');

// if the instance id ends with !nnn, where nnn is an unsigned number, it indicates explicit partition placement
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the first test, I think we're just missing the very last one:

Let's also add a test that checks that instanceID myinstanceID!NotANumber does not trigger any errors / that it correctly ignores the explicit placement logic.

Comment on lines 322 to 326
/// <summary>
/// Whether to allow instanceIDs to use special syntax to land on a specific partition.
/// If enabled, when an instanceID ends with suffix '!nnn', where 'nnn' is an unsigned number, the instance will land on the partition/queue for to that number.
/// ** DO NOT CHANGE THIS FLAG FOR PRE-EXISTING MESSAGES AS IT MAY BE CONSIDERED IN THE WRONG QUEUE **
/// </summary>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I believe remark is the right docs tag for this sort of thing. Let's also not use all-caps for comments, I realize it has a useful effect, but it's not in our usual style for customer-facing docs.

Suggested change
/// <summary>
/// Whether to allow instanceIDs to use special syntax to land on a specific partition.
/// If enabled, when an instanceID ends with suffix '!nnn', where 'nnn' is an unsigned number, the instance will land on the partition/queue for to that number.
/// ** DO NOT CHANGE THIS FLAG FOR PRE-EXISTING MESSAGES AS IT MAY BE CONSIDERED IN THE WRONG QUEUE **
/// </summary>
/// <summary>
/// Whether to allow instanceIDs to use special syntax to land on a specific partition.
/// If enabled, when an instanceID ends with suffix '!nnn', where 'nnn' is an unsigned number, the instance will land on the partition/queue for to that number.
/// ** DO NOT CHANGE THIS FLAG FOR PRE-EXISTING MESSAGES AS IT MAY BE CONSIDERED IN THE WRONG QUEUE **
/// </summary>
/// <remarks>
/// It is not generally safe to change to this flag for pre-existing TaskHubs, as it may change the expected target queue for an instanceID.
/// </remarks>

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - I'll take the liberty of deleting the deleting the comment about adding a test at the beginning of TaskHubClient, since this is public documentation so it does not belong there. After that, I'll leave a 'LGTM'

src/DurableTask.Core/TaskHubClient.cs Outdated Show resolved Hide resolved
@@ -707,6 +707,7 @@ void CreateAndTrackDependencyTelemetry(TraceContextBase requestTraceContext)
}

/// <summary>
/// Add test for this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Add test for this.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please get an approval from a current repo owner (@cgillum, or @jviau, for example) before merging. But this seems good to me, and that any remaining tests we could just add on your behalf with ease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants