Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] The client receives an application offline change notification, but channel.close is not triggered 100% of the time #14938

Open
4 tasks done
TerryLam2010 opened this issue Nov 25, 2024 · 7 comments
Labels
component/need-triage Need maintainers to triage type/need-triage Need maintainers to triage

Comments

@TerryLam2010
Copy link

TerryLam2010 commented Nov 25, 2024

Pre-check

  • I am sure that all the content I provide is in English.

Search before asking

  • I had searched in the issues and found no similar issues.

Apache Dubbo Component

Java SDK (apache/dubbo)

Dubbo Version

Dubbo Java: 3.2.3
JDK8
Use K8s

Steps to reproduce this issue

Provider and Consumer config:
<dubbo:application name="${spring.application.name}" logger="slf4j" register-mode="all" metadata-type="remote" >
<dubbo:parameter key="qos.enable" value="false" />
</dubbo:application>

other config property:
dubbo.registry.parameters.register-consumer-url : true
dubbo.application.serialize-check-status : DISABLE
dubbo.application.enable-empty-protection : true
dubbo.provider.prefer-serialization : hessian2
dubbo.provider.serialization : hessian2

I'm from dubbo 2.7.8 upgrade to 3.2.3, found that when the provider referrals, consumers occasionally trigger org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask rewiring (only once), The provider itself closes the channel. But I've looked through the code, and if nacos is notified to go offline, the client will destroy the old DubboInvoker, triggering the channel shutdown and leaving the ReconnectTimerTask empty.
My situation now is that I see the log org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory class has come in, also found that destroy the invoker corresponding log, However, the Close netty channel log did not find that although the application offline notification came, it did not close the channel.
Because in the process of gradual launch, the logs I give you are all test environments, all in the 3.2.3 version. The production environment is basically 2.7.8, and only one application is 3.2.3
——————————————————————————————————————————————————
我是从dubbo 2.7.8升级到3.2.3,发现当提供者下线的时候,消费者偶尔触发org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask的重连(仅一次),是因为provider自己关闭channel了。但我浏览过代码,如果通知了nacos下线,client会将旧的DubboInvoker销毁,从而触发channel关闭以及将ReconnectTimerTask的任务置空。
现在我的情况是,我看日志org.apache.dubbo.registry.client.ServiceDiscoveryRegistryDirectory类有进来,也发现destroy invoker对应的日志,但是Close netty channel 的日志并没有发现,那就是虽然应用下线通知过来了,但是并没有关闭channel。

What you expected to happen

I think org.apache.dubbo.rpc.protocol.dubbo.ReferenceCountExchangeClientcounter problem, lead to can not close the channel. Because my local online and offline is normal, only in the test environment and online environment is repeated, just look at the log can not see the problem

Anything else

2024-11-22T14:53:36,348+0800 [DUBBO] destroy invoker[DefaultServiceInstance{serviceName='order', host='172.29.89.45', port=11060, enabled=true, healthy=true, metadata={dubbo.endpoints=[{"port":11060,"protocol":"dubbo"}], dubbo.metadata.revision=e9aa0faa8181f81a787a5ce1362ce1c7, dubbo.metadata.storage-type=remote, timestamp=1732256698869}}, service{name='com.xxx.xxx.client.rpc.OrderRuleCheckRpcService',group='uat',version='null',protocol='dubbo',port='11060',params={side=provider, heartbeat=30000, release=3.2.3, methods=check,checkAndThrowEx, logger=slf4j, deprecated=false, dubbo=2.0.2, threads=20, interface=com.xxxx.xxxx.client.rpc.OrderRuleCheckRpcService, service-name-mapping=true, threadpool=cached, timeout=1000, generic=false, revision=1.0.0, serialize.check.status=DISABLE, serialization=hessian2, retries=0, metadata-type=remote, application=order, prefer.serialization=hessian2, dynamic=true, enable-empty-protection=true, group=uat},}] success. , dubbo version: 3.2.3, current host: 172.29.119.195

2024-11-22T14:53:36,348+0800 [DUBBO] 1 deprecated invokers deleted., dubbo version: 3.2.3, current host: 172.29.119.195

2024-11-22T14:53:36,349+0800 [DUBBO] serviceKey:uat/com.xxxx.xxxx.client.rpc.OrderRuleCheckRpcService Instance address size 2, interface address size 3, threshold 0.0, dubbo version: 3.2.3, current host: 172.29.119.195

2024-11-22T14:53:36,349+0800 [DUBBO] Received invokers changed event from registry. Registry type: instance. Service Key: uat/com.xxxx.xxxx.client.rpc.OrderRuleCheckRpcService. Urls Size : 2. Invokers Size : 2. Available Size: 2. Available Invokers : 172.29.92.255:11060,172.29.89.9:11060, dubbo version: 3.2.3, current host: 172.29.119.195

It does not make sense that these logs are triggered, but Close Netty channel does not occur.
触发了这些日志,但是并没有发生Close Netty channel, 这并不合理。

Are you willing to submit a pull request to fix on your own?

  • Yes I am willing to submit a pull request on my own!

Code of Conduct

@TerryLam2010 TerryLam2010 added component/need-triage Need maintainers to triage type/need-triage Need maintainers to triage labels Nov 25, 2024
@laywin
Copy link
Contributor

laywin commented Nov 26, 2024

Normally it will be also notifyed by RegistryDirectory, because it is dual registration(interface & instance). when all of them finished destory, the count of ReferenceCountExchangeClient will be zero, so it will be destory.

@TerryLam2010
Copy link
Author

@laywin That is now the Destory of the ServiceDisCoveryRegistrydirectory, but the Destroyunovokers triggered by the registryDirectory did not print the destruction log. That is, nowadays, all INVOKEER destroyed. ReferenceCountexchangeClient's counter is not 0, and it will not netty client close. How should I check why it is not ruined? Intersection Normally, ServiceDiscoveryRegistryDirectory and RegistryDirectory will be triggered? Or is there any ready -made case for reference?
——————————————————————————————————————————————————————
那现在就是触发了ServiceDiscoveryRegistryDirectory的destory,但是在RegistryDirectory触发的destroyUnusedInvokers 却没有打印销毁日志。那就是现在没有调用完所有invokeer销毁,ReferenceCountExchangeClient的计数器不为0,就不会netty client close. 我应该怎么排查为什么会没有销毁完呢??正常来说ServiceDiscoveryRegistryDirectory和RegistryDirectory都会被触发吧?或者有什么现成的案例提供参考?

@laywin
Copy link
Contributor

laywin commented Nov 26, 2024

when provider shutdown, in consumer side you can watch the network connect (netstat -apn | grep 20880)

@TerryLam2010
Copy link
Author

TerryLam2010 commented Nov 27, 2024

@laywin It is difficult to recurrent this problem, and provider may only appear once 10 times. Under normal circumstances, the "destroy invoker" of ServicesCoveryRegiskRyDirectory and RegistryDirectory will trigger. Sometimes the two are missing one before, and then the "Netty Channl" cannot be closed. Do you have any good suggestions for reappearance?
It is a way to monitor the network port. In fact, the search log "Close netty channel" can be seen.
————————————————————————————————————————————————————————
很难复现此问题,提供者可能10次只出现一次。在正常情况下,Services CoveryRegiskryDirectory和RegistryDirectory的“destroy invoker”日志将触发。有时,两者之间会丢失了一个,然后nettty channel无法关闭。您对复现有什么好建议吗?
对于监听网络端口这种方式,是一种方式,其实搜索日志“Close netty channel” 可以看出来的。

@AlbumenJ
Copy link
Member

@TerryLam2010 Can you test on the latest Dubbo 3.2 version. We have fixed several concurrency issues in 3.2

@TerryLam2010
Copy link
Author

TerryLam2010 commented Nov 29, 2024

@AlbumenJ I am using Dubbo 3.2.3 in the test environment. The issue is that the instance-level offline notification for the ServiceDiscoveryRegistryDirectory has come, but the interface-level notification for RegistryDirectory has not, resulting in the counter of ReferenceCountExchangeClient not being zero.Could you tell me how the refreshInvoker in RegistryDirectory can be triggered? Is there a sequence of events between ServiceDiscoveryRegistryDirectory and RegistryDirectory?
Due to some class deletion issues, I cannot upgrade beyond version 3.2.3. The maximum version I can currently use is 3.2.3. Can you tell me what code was modified to fix this problem? Regarding the provider going offline, I am not directly executing "ApplicationModel.defaultModel().destroy()". Instead, I first execute "Offline offline = new Offline(ApplicationModel.defaultModel().getFrameworkModel()); offline.execute(null, null);", and after waiting for 10 seconds, I then execute "ApplicationModel.defaultModel().destroy()."
——————————————————————————————————————————————————————
我在测试环境就是使用3.2.3的dubbo,现在的问题就是ServiceDisCoveryRegistrydirectory实例级的下线通知来了,但是RegistryDirectory 接口级的通知没有,导致ReferenceCountExchangeClient的计数器不为0。能不能跟我说下,RegistryDirectory的refreshInvoker怎么才能被触发呢?ServiceDisCoveryRegistrydirectory和RegistryDirectory 有没有先后关系?
由于一些类删除问题,导致我不能再升级超过3.2.3。现状能使用最大的版本只能是3.2.3。能否告诉我,修改了什么代码从而修复这个问题。对应provider下线,我不是直接执行 ApplicationModel.defaultModel().destroy() 的。而是先执行了 Offline offline = new Offline(ApplicationModel.defaultModel().getFrameworkModel());offline.execute(null, null); 等待了10秒之后才执行 ApplicationModel.defaultModel().destroy()。

@AlbumenJ
Copy link
Member

AlbumenJ commented Dec 6, 2024

@AlbumenJ I am using Dubbo 3.2.3 in the test environment. The issue is that the instance-level offline notification for the ServiceDiscoveryRegistryDirectory has come, but the interface-level notification for RegistryDirectory has not, resulting in the counter of ReferenceCountExchangeClient not being zero.Could you tell me how the refreshInvoker in RegistryDirectory can be triggered? Is there a sequence of events between ServiceDiscoveryRegistryDirectory and RegistryDirectory? Due to some class deletion issues, I cannot upgrade beyond version 3.2.3. The maximum version I can currently use is 3.2.3. Can you tell me what code was modified to fix this problem? Regarding the provider going offline, I am not directly executing "ApplicationModel.defaultModel().destroy()". Instead, I first execute "Offline offline = new Offline(ApplicationModel.defaultModel().getFrameworkModel()); offline.execute(null, null);", and after waiting for 10 seconds, I then execute "ApplicationModel.defaultModel().destroy()." —————————————————————————————————————————————————————— 我在测试环境就是使用3.2.3的dubbo,现在的问题就是ServiceDisCoveryRegistrydirectory实例级的下线通知来了,但是RegistryDirectory 接口级的通知没有,导致ReferenceCountExchangeClient的计数器不为0。能不能跟我说下,RegistryDirectory的refreshInvoker怎么才能被触发呢?ServiceDisCoveryRegistrydirectory和RegistryDirectory 有没有先后关系? 由于一些类删除问题,导致我不能再升级超过3.2.3。现状能使用最大的版本只能是3.2.3。能否告诉我,修改了什么代码从而修复这个问题。对应provider下线,我不是直接执行 ApplicationModel.defaultModel().destroy() 的。而是先执行了 Offline offline = new Offline(ApplicationModel.defaultModel().getFrameworkModel());offline.execute(null, null); 等待了10秒之后才执行 ApplicationModel.defaultModel().destroy()。

We have fixed couple of bugs about connections in 3.2.x and it is hard to figure out which. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/need-triage Need maintainers to triage type/need-triage Need maintainers to triage
Projects
Status: Todo
Development

No branches or pull requests

3 participants