You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We run orleans in docker swarm. In case of datacenter failure silos are moved to other datacenter by swarm. Redis instance is also down and second instance in cluster is working at this time.
Problems that occur: 1. Redis memebership implementation could not detect change in redis and still tries to connect to downed instance and timed out.
2. Newly started silo instances do connect to redis, but as they see active silos in membership table, they try to ping them and fail as there are silos that are marked active but are dead in reality and old instances can't update it as they have no access to membership table. New silos declare themselves dead. Tried to change IAmAliveTablePublishTimeout and made it 1 minute. So my expectation would be that in 2 minutes (considering retries) new silos would still start up, and kill other inaccessible silos. For example there are 9 silos in total, and on datacenter shutdown 4 silos where dead and try to start in swarm on other datacenter. 4 silos should start in 2 minutes and kill left 5 silos.
In reality nothing is started up and whole cluster is down with following messages:
warn: Orleans.Runtime.Metadata.ClusterManifestProvider[0]
Error retrieving silo manifest for silo S10.224.1.52:11111:90056967
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'IServiceProvider'.
at Microsoft.Extensions.DependencyInjection.ServiceLookup.ThrowHelper.ThrowObjectDisposedException()
at Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope.GetService(Type serviceType)
at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService(IServiceProvider provider, Type serviceType)
at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService[T](IServiceProvider provider)
at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 151
Noticed that silo S10.224.0.229:11111:90056367 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:38, now is 11/08/2024 07:58:08, no update for 00:18:30.2544584, which is more than 00:01:20.
Noticed that silo S10.224.0.231:11111:90056456 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:41:07, now is 11/08/2024 07:58:08, no update for 00:17:01.1361120, which is more than 00:01:20.
Noticed that silo S10.224.0.245:11111:90056587 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:57, now is 11/08/2024 07:58:08, no update for 00:14:10.6124054, which is more than 00:01:20.
Noticed that silo S10.224.0.247:11111:90056616 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:51, now is 11/08/2024 07:58:08, no update for 00:14:16.8296069, which is more than 00:01:20.
Noticed that silo S10.224.0.228:11111:90056363 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:34, now is 11/08/2024 07:58:08, no update for 00:18:33.7728804, which is more than 00:01:20.
Noticed that silo S10.224.0.232:11111:90056493 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:42:28, now is 11/08/2024 07:58:08, no update for 00:15:40.0318086, which is more than 00:01:20.
Noticed that silo S10.224.1.2:11111:90056742 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:45:53, now is 11/08/2024 07:58:08, no update for 00:12:15.2249331, which is more than 00:01:20.
Noticed that silo S10.224.1.52:11111:90056967 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:49:31, now is 11/08/2024 07:58:08, no update for 00:08:36.9655455, which is more than 00:01:20.
Noticed that silo S10.224.0.229:11111:90056367 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:38, now is 11/08/2024 07:58:08, no update for 00:18:30.2683603, which is more than 00:01:20.
Noticed that silo S10.224.0.231:11111:90056456 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:41:07, now is 11/08/2024 07:58:08, no update for 00:17:01.1500076, which is more than 00:01:20.
Noticed that silo S10.224.0.245:11111:90056587 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:57, now is 11/08/2024 07:58:08, no update for 00:14:10.6263153, which is more than 00:01:20.
Noticed that silo S10.224.0.247:11111:90056616 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:51, now is 11/08/2024 07:58:08, no update for 00:14:16.8435249, which is more than 00:01:20.
Noticed that silo S10.224.0.228:11111:90056363 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:34, now is 11/08/2024 07:58:08, no update for 00:18:33.7867953, which is more than 00:01:20.
Noticed that silo S10.224.0.232:11111:90056493 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:42:28, now is 11/08/2024 07:58:08, no update for 00:15:40.0457486, which is more than 00:01:20.
Noticed that silo S10.224.1.2:11111:90056742 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:45:53, now is 11/08/2024 07:58:08, no update for 00:12:15.2388805, which is more than 00:01:20.
Noticed that silo S10.224.1.52:11111:90056967 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:49:31, now is 11/08/2024 07:58:08, no update for 00:08:36.9794955, which is more than 00:01:20.
warn: Orleans.Runtime.Silo[100418]
Silo shutdown completed (non-graceful)!
Unhandled exception. Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.224.0.229:11111:90056367, will retry after 953.288ms
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99
at Orleans.Runtime.Messaging.MessageCenter.g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 236
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 90
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 739
at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29
at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 368
at Orleans.Runtime.Messaging.MessageCenter.g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 487
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync(GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 98
at Program.<>c.<<
$>b__0_2>d.MoveNext() in /home/vsts/work/1/s/Slots/Ardados.Slots.SiloHost/Program.cs:line 38
--- End of stack trace from previous location ---
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct) in /_/src/Orleans.Runtime/Lifecycle/SiloLifecycleSubject.cs:line 134
at Orleans.LifecycleSubject.OnStart(CancellationToken cancellationToken) in /_/src/Orleans.Core/Lifecycle/LifecycleSubject.cs:line 118
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in /_/src/Orleans.Runtime/Scheduler/ClosureWorkItem.cs:line 33
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Silo/Silo.cs:line 192
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Hosting/SiloHostedService.cs:line 28
at Microsoft.Extensions.Hosting.Internal.Host.b__15_1(IHostedService service, CancellationToken token)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Program.
$(String[] args) in /home/vsts/work/1/s/Slots/Ardados.Slots.SiloHost/Program.cs:line 44
at Program.
(String[] args)
In redis membership table, there are constantly growing count of newly inserted records of silos.
I guess silos are starting up, writing their state into table, then trying to check other silos, see that not all are pingable and kill themselves. And this is never ending loop.
What is wrong with our assumption, and why do we receive this error on start up?
The text was updated successfully, but these errors were encountered:
I see also from my investigations that when the silo want to report any updates to the membership table it seems that the Redis connection is being disposed while the silo Is still active
We run orleans in docker swarm. In case of datacenter failure silos are moved to other datacenter by swarm. Redis instance is also down and second instance in cluster is working at this time.
Problems that occur: 1. Redis memebership implementation could not detect change in redis and still tries to connect to downed instance and timed out.
2. Newly started silo instances do connect to redis, but as they see active silos in membership table, they try to ping them and fail as there are silos that are marked active but are dead in reality and old instances can't update it as they have no access to membership table. New silos declare themselves dead. Tried to change IAmAliveTablePublishTimeout and made it 1 minute. So my expectation would be that in 2 minutes (considering retries) new silos would still start up, and kill other inaccessible silos. For example there are 9 silos in total, and on datacenter shutdown 4 silos where dead and try to start in swarm on other datacenter. 4 silos should start in 2 minutes and kill left 5 silos.
In reality nothing is started up and whole cluster is down with following messages:
warn: Orleans.Runtime.Metadata.ClusterManifestProvider[0]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]
warn: Orleans.Runtime.Silo[100418]
Unhandled exception. Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.224.0.229:11111:90056367, will retry after 953.288ms
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99
at Orleans.Runtime.Messaging.MessageCenter.g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 236
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 90
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 739
at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29
at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 368
at Orleans.Runtime.Messaging.MessageCenter.g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 487
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync(GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 98
at Program.<>c.<<
$>b__0_2>d.MoveNext() in /home/vsts/work/1/s/Slots/Ardados.Slots.SiloHost/Program.cs:line 38--- End of stack trace from previous location ---
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct) in /_/src/Orleans.Runtime/Lifecycle/SiloLifecycleSubject.cs:line 134
at Orleans.LifecycleSubject.OnStart(CancellationToken cancellationToken) in /_/src/Orleans.Core/Lifecycle/LifecycleSubject.cs:line 118
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in /_/src/Orleans.Runtime/Scheduler/ClosureWorkItem.cs:line 33
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Silo/Silo.cs:line 192
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Hosting/SiloHostedService.cs:line 28
at Microsoft.Extensions.Hosting.Internal.Host.b__15_1(IHostedService service, CancellationToken token)
at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable
1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List
1 exceptions, Func`3 operation)at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Program.
$(String[] args) in /home/vsts/work/1/s/Slots/Ardados.Slots.SiloHost/Program.cs:line 44at Program.
(String[] args)In redis membership table, there are constantly growing count of newly inserted records of silos.
I guess silos are starting up, writing their state into table, then trying to check other silos, see that not all are pingable and kill themselves. And this is never ending loop.
What is wrong with our assumption, and why do we receive this error on start up?
The text was updated successfully, but these errors were encountered: