[apache/dubbo]3.1.0版本proxyless方案无法感知服务变化和服务节点变化

Dubbo version: 3.1.0
Operating System version: xxx
Java version: 1.8
Istio version: 1.13.2

运行Istio+proxyless demo程序，启动后consumer能正常访问provider。
扩大provider副本数为2， consumer无法感知到新增的provider节点，新增的provider节点也没有收到consumer的请求。
删除provider对应的k8s service（取消服务注册），consumer没有感知到服务变化，仍然可以继续调用provider。
恢复provider service后，重启consumer实例，consumer才可以正常发现2个provider节点。

zhaoli2333

扩容后应该是可以正常使用的，可以看下 kube service 的 endpoint 是否已经更新了

AlbumenJ

endpoint显示已经更新，demo程序无法发现新的节点

wucheng1997

问题更新，重新构建镜像之后，可以正常感知服务变化。可能是本地环境问题。

wucheng1997

问题复现，测试发现，程序运行一段时间后发生error，后续就无法收到相关xds相关通知，导致出现如题的问题

[25/08/22 02:30:44:044 UTC] grpc-default-executor-10 ERROR protocol.AbstractProtocol: [DUBBO] xDS Client received error message! detail:, dubbo version: 1.0-SNAPSHOT, current host: x.x.x.x io.grpc.StatusRuntimeException: UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR at io.grpc.Status.asRuntimeException(Status.java:535) at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:479) at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562) at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

wucheng1997

看起来是需要添加自动重连的逻辑，可以帮忙提一下修改嘛？

AlbumenJ

pls assign me

MentosL

程序大约运行30分钟后稳定出现上述现象，推测可能与grpc的空闲链接idleTimeout有关

参考grpc文档 In idle mode the channel shuts down all connections, the NameResolver and the LoadBalancer. A new RPC would take the channel out of idle mode. A channel starts in idle mode. Defaults to 30 minutes.

尝试几种方案 1 增加 keepAliveTime配置 2 设置idleTimeout参数本地测试均无法解决上述问题。

看上去还有个问题是error导致线程异常，进一步导致observe功能失效。单纯的重连可能无法解决问题，不知道有没有好的修复建议。

wucheng1997

提交了一个pr来修复相关问题，排查主要原因是因为observer onError以后无法继续执行onNext，程序会卡死在这里，导致无法处理服务变化

经过本地测试，接受到Connection closed after GOAWAY错误后能够继续正常工作，感知服务变化 https://github.com/apache/dubbo/pull/10544

wucheng1997

@wucheng1997 同学你好，我本地也按照上述step去复现，发现当时扩大副本数量是没有问题的（如图一），但是运行几分钟后会出现错误（如图二）。此相关问题也查询对应的信 UNAVAILABLE: HTTP/2 error code: NO_ERROR Received Goaway，其中描述的都是与netty相关的jira问题，与此类问题方向大致相同，就是重试，但是此问题还是会复现。

最后发现在服务缩减再扩增发现了其他的问题--- replicas=3 始终调用只能调用俩台，第三台调用不到，目前还在查看

MentosL

我是稳定30分钟复现，不知道你这个时间点是多少，你使用的istio是原生的吗？这条信息可能是istio服务端发过来的。然后出问题也不是通信出了问题，xdsChannel的状态是正常的。问题出在observer，代码卡在observer.onNext。所以我在error之后重建了observer。你缩减再扩增发现无法调用第三台，可能是此时observer已经出了问题，无法再处理服务变化信息了。正常服务变更处理的时候会有日志打印，你可以观察一下，我没有遇到你说的问题，我每个节点都能正常调用。另外可以试下我的修改，看能不能解决你的问题，我这里测试是正常的。

wucheng1997

感谢回复，出现问题的时间点约在扩增后5min左右出现的。我尝试下你的改动

MentosL

补充一下问题细节，istiod默认有个keepaliveMaxServerConnectionAge参数设置为30m，

链接建立30分钟后istiod会有如下日志 ads ADS: "" dubbo-samples-xds-consumer- terminated rpc error: code = Canceled desc = context canceled

此时客户端收到的Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR

wucheng1997

这个问题在 envoy 和 istio-agent 是怎么解决的

AlbumenJ

从istiod的日志来看，envoy sidecar模式，收到canceled消息之后会新建一个新的链接相关日志如下

2022-09-07T06:38:59.014611Z info ads ADS: "" demo-test-6769 terminated rpc error: code = Canceled desc = context canceled 2022-09-07T06:38:59.495091Z info ads ADS: new connection for node:demo-test-6813

wucheng1997

看到了了envoy的解决方案

Makes the HTTP health checker handle GOAWAY properly. When the NO_ERROR code is received, any in flight request will be allowed to complete, at which time the connection will be closed and a new connection created on the next interval.

https://github.com/envoyproxy/envoy/pull/13599

https://github.com/grpc/grpc-java/issues/9522 贴上grpc社区的回复

Everything seems to be working like we'd expect. When you receive onError on the StreamObserver that RPC is dead. You can create a new RPC to replace it.

The NO_ERROR itself is not a concern. That is just saying the connection was closed routinely; no implementation did something wrong. From the HTTP/2 spec:

NO_ERROR (0x00): The associated condition is not a result of an error. For example, a GOAWAY might include this code to indicate graceful shutdown of a connection.

wucheng1997

[apache/dubbo]3.1.0版本proxyless方案无法感知服务变化和服务节点变化

回答