[alibaba/nacos]nacos1.4.2 在k8s中以statefulset部署为集群,其中一个nacos节点因为健康检查失败后,剩下两个nacos节点raft选举失败导致微服务的服务发现失败,业务受到影响。

2023-12-20 987 views
5

Describe the bug 在k8s集群中以statefulset方式部署nacos1.4.2 cluster,在statefulset里配置了podAntiAffinity让nacos-0/nacos-1/nacos-2分别调度到3台不同的k8s node节点。最近在客户环境不定期出现其中一个nacos容器因为健康检查失败导致重启,其中一个nacos容器在重启的时候客户在使用应用程序的时候出现"负载均衡中找不到可以使用的微服务:xxxx"。查看nacos的日志文件naming-raft.log发现如下内容:

2023-02-01 02:01:08,633 INFO Raft group naming_persistent_service has leader nacos-0.nacos-headless.default.svc.cluster.local.:7848
2023-02-01 02:01:15,003 WARN [IS LEADER] no leader is available now!
2023-02-01 02:01:15,657 INFO leader timeout, start voting,leader: null, term: 0
2023-02-01 02:01:15,667 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:15,935 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:30,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:01:32,657 INFO leader timeout, start voting,leader: null, term: 1
2023-02-01 02:01:32,661 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:32,661 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:45,001 WARN [IS LEADER] no leader is available now!
2023-02-01 02:01:48,157 INFO leader timeout, start voting,leader: null, term: 2
2023-02-01 02:01:48,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:48,401 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:00,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:02:08,157 INFO leader timeout, start voting,leader: null, term: 3
2023-02-01 02:02:08,162 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:08,163 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:15,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:02:24,157 INFO leader timeout, start voting,leader: null, term: 4
2023-02-01 02:02:24,159 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:24,161 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:30,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:02:44,157 INFO leader timeout, start voting,leader: null, term: 5
2023-02-01 02:02:44,159 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:44,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:45,001 WARN [IS LEADER] no leader is available now!
2023-02-01 02:03:00,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:03:01,157 INFO leader timeout, start voting,leader: null, term: 6
2023-02-01 02:03:01,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:03:01,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:03:15,001 WARN [IS LEADER] no leader is available now!
2023-02-01 02:03:15,121 WARN start to close old raft protocol!!!
2023-02-01 02:03:15,121 WARN stop old raft protocol task for notifier
2023-02-01 02:03:15,121 WARN stop old raft protocol task for master task
2023-02-01 02:03:15,121 WARN stop old raft protocol task for heartbeat task
2023-02-01 02:03:15,121 WARN clean old cache datum for old raft
2023-02-01 02:03:15,121 WARN start to close old raft protocol!!!
2023-02-01 02:03:15,121 WARN stop old raft protocol task for notifier
2023-02-01 02:03:15,121 WARN stop old raft protocol task for master task
2023-02-01 02:03:15,121 WARN stop old raft protocol task for heartbeat task
2023-02-01 02:03:15,121 WARN clean old cache datum for old raft
2023-02-01 02:03:15,121 WARN start to move old raft protocol metadata
2023-02-01 02:03:15,121 WARN start to move old raft protocol metadata

看日志猜测是3个节点的nacos在其中的1个节点重启后,剩余的2个节点raft选举失败,导致nacos集群没有leader可用。raft协议剩余2个节点的时候选举应该加一些算法来保证可以选举出leader节点吧?这个在1.4.2的版本如何避免再次出现呢?

Expected behavior A clear and concise description of what you expected to happen.

Actually behavior A clear and concise description of what you actually to happen.

How to Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See errors

Desktop (please complete the following information):

  • OS: [e.g. Centos]
  • Version [e.g. nacos-server 1.3.1, nacos-client 1.3.1]
  • Module [e.g. naming/config]
  • SDK [e.g. original, spring-cloud-alibaba-nacos, dubbo]

Additional context Add any other context about the problem here.

回答

7

如果nacos升级的话应该升级到什么版本能彻底解决这个问题呢?1.4.4还是2.2.0? 客户现场的网络会波动,这个无法控制和彻底解决。nacos cluster集群在1个节点异常连不上的时候,整个集群应该要正常对外提供服务。

3

显然你没理解原理:

  1. 日志显示的是重启的节点, 因为其他节点已经切换至jraft, old raft已经关闭,所以vote自然会失效,最后也显示关闭了old raft,说明重启节点也切换至jraft进行工作
  2. 没有证据证明另外2个节点没有提供服务。
  3. 业务找不到服务提供者可能是从刚重启的节点中获取到未同步完成的数据,nacos的临时实例保证的最终一致性AP协议,启动时未同步完成前数据会短暂不一致,loadbalance有自己的缓存会导致这个不一致事件被急剧扩大,可以开启客户端推空保护避免,或者关闭loadbalance的缓存,交由nacos-client的缓存在控制。
2

我先附上nacos-1重启时nacos相关的日志

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:153) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:55) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2023-02-01 02:00:28,793 INFO onStopFollowing: LeaderChangeContext [leaderId=nacos-0.nacos-headless.default.svc.cluster.local.:7848, term=19, status=Status[ERAFTTIMEDOUT<10001>: Lost connection from leader nacos-0.nacos-headless.default.svc.cluster.local .:7848.]]. 2023-02-01 02:00:28,792 INFO Node <naming_persistent_service/nacos-1.nacos-headless.default.svc.cluster.local.:7848> term 19 start preVote. 2023-02-01 02:00:28,839 WARN [GRPC] failed to send response: {}. io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:153) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:55) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2023-02-01 02:00:28,839 WARN [GRPC] failed to send response: {}. io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:153) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:464) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.runTask(MpscSingleThreadExecutor.java:352) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.run(MpscSingleThreadExecutor.java:336) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor.lambda$doStartWorker$3(MpscSingleThreadExecutor.java:263) at java.lang.Thread.run(Thread.java:748) 2023-02-01 02:00:28,839 WARN [GRPC] failed to send response: {}. io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:153) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:464) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.runTask(MpscSingleThreadExecutor.java:352) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.run(MpscSingleThreadExecutor.java:336) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor.lambda$doStartWorker$3(MpscSingleThreadExecutor.java:263) at java.lang.Thread.run(Thread.java:748) 2023-02-01 02:00:28,839 WARN [GRPC] failed to send response: {}. io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:153) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:464) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.runTask(MpscSingleThreadExecutor.java:352) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.run(MpscSingleThreadExecutor.java:336) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor.lambda$doStartWorker$3(MpscSingleThreadExecutor.java:263) at java.lang.Thread.run(Thread.java:748) 2023-02-01 02:00:28,839 WARN [GRPC] failed to send response: {}. io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341) at com.alipay.sofa.jraft.rpc.impl.GrpcServer$1.sendResponse(GrpcServer.java:153) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:464) at com.alipay.sofa.jraft.rpc.impl.core.AppendEntriesRequestProcessor.processRequest0(AppendEntriesRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35) at com.alipay.sofa.jraft.rpc.impl.GrpcServer.lambda$null$1(GrpcServer.java:194) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.runTask(MpscSingleThreadExecutor.java:352) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor$Worker.run(MpscSingleThreadExecutor.java:336) at com.alipay.sofa.jraft.util.concurrent.MpscSingleThreadExecutor.lambda$doStartWorker$3(MpscSingleThreadExecutor.java:263) at java.lang.Thread.run(Thread.java:748) 2023-02-01 02:00:28,881 INFO Node <naming_persistent_service/nacos-1.nacos-headless.default.svc.cluster.local.:7848> received PreVoteResponse from nacos-0.nacos-headless.default.svc.cluster.local.:7848, term=19, granted=false. 2023-02-01 02:00:28,905 INFO Node <naming_persistent_service/nacos-1.nacos-headless.default.svc.cluster.local.:7848> received PreVoteResponse from nacos-2.nacos-headless.default.svc.cluster.local.:7848, term=19, granted=false. 2023-02-01 02:01:05,267 INFO SPI service [com.alipay.sofa.jraft.JRaftServiceFactory - com.alipay.sofa.jraft.core.DefaultJRaftServiceFactory] loading. 2023-02-01 02:01:05,348 INFO SPI service [com.alipay.sofa.jraft.rpc.RaftRpcFactory - com.alipay.sofa.jraft.rpc.impl.GrpcRaftRpcFactory] loading. 2023-02-01 02:01:05,810 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeDescribeSignalHandler] loading. 2023-02-01 02:01:05,810 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeMetricsSignalHandler] loading. 2023-02-01 02:01:05,811 INFO SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.ThreadPoolMetricsSignalHandler] loading. 2023-02-01 02:01:05,814 INFO SPI service [com.alipay.sofa.jraft.util.timer.RaftTimerFactory - com.alipay.sofa.jraft.util.timer.DefaultRaftTimerFactory] loading. 2023-02-01 02:01:05,820 INFO The number of active nodes increment to 1. 2023-02-01 02:01:07,003 INFO Starts FSMCaller successfully. 2023-02-01 02:01:07,018 INFO Loading snapshot, meta=last_included_index: 1045 last_included_term: 19 peers: "nacos-1.nacos-headless.default.svc.cluster.local.:7848" peers: "nacos-0.nacos-headless.default.svc.cluster.local.:7848" peers: "nacos-2.nacos-headless.default.svc.cluster.local.:7848" . 2023-02-01 02:01:07,112 INFO Node <naming_persistent_service/nacos-1.nacos-headless.default.svc.cluster.local.:7848> onSnapshotLoadDone, last_included_index: 1045 last_included_term: 19 peers: "nacos-1.nacos-headless.default.svc.cluster.local.:7848" peers: "nacos-0.nacos-headless.default.svc.cluster.local.:7848" peers: "nacos-2.nacos-headless.default.svc.cluster.local.:7848" 2023-02-01 02:01:07,112 INFO Node <naming_persistent_service/nacos-1.nacos-headless.default.svc.cluster.local.:7848> target priority value has changed from: 0, to: -1. 2023-02-01 02:01:07,133 INFO Node <naming_persistent_service/nacos-1.nacos-headless.default.svc.cluster.local.:7848> init, term=19, lastLogId=LogId [index=1045, term=19], conf=nacos-1.nacos-headless.default.svc.cluster.local.:7848,nacos-0.nacos-headless .default.svc.cluster.local.:7848,nacos-2.nacos-headless.default.svc.cluster.local.:7848, oldConf=. 2023-02-01 02:01:07,135 WARN RPC server is not started in RaftGroupService. 2023-02-01 02:01:07,135 INFO Start the RaftGroupService successfully. 2023-02-01 02:01:07,217 WARN Channel in TRANSIENT_FAILURE state: nacos-1.nacos-headless.default.svc.cluster.local.:7848. 2023-02-01 02:01:07,218 WARN Channel in SHUTDOWN state: nacos-1.nacos-headless.default.svc.cluster.local.:7848. 2023-02-01 02:01:07,496 WARN Channel in TRANSIENT_FAILURE state: nacos-0.nacos-headless.default.svc.cluster.local.:7848. 2023-02-01 02:01:07,496 WARN Channel in SHUTDOWN state: nacos-0.nacos-headless.default.svc.cluster.local.:7848.

> 如下为nacos-cluster.log的日志

2023-02-01 02:01:05,068 INFO Current addressing mode selection : FileConfigMemberLookup 2023-02-01 02:01:05,073 WARN [serverlist] updated to : [Member{ip='nacos-2.nacos-headless.default.svc.cluster.local.', port=8848, state=DOWN, extendInfo={raftPort=7848}}, Member{ip='nacos-0.nacos-headless.default.svc.cluster.local.', port=8848, state=DO WN, extendInfo={raftPort=7848}}, Member{ip='nacos-1.nacos-headless.default.svc.cluster.local.', port=8848, state=UP, extendInfo={raftPort=7848}}] 2023-02-01 02:01:05,080 WARN [serverlist] updated to : [Member{ip='nacos-2.nacos-headless.default.svc.cluster.local.', port=8848, state=DOWN, extendInfo={raftPort=7848}}, Member{ip='nacos-0.nacos-headless.default.svc.cluster.local.', port=8848, state=DO WN, extendInfo={raftPort=7848}}, Member{ip='nacos-1.nacos-headless.default.svc.cluster.local.', port=8848, state=UP, extendInfo={raftPort=7848}}] 2023-02-01 02:01:09,237 INFO This node is ready to provide external services 2023-02-01 03:59:39,000 ERROR failed to report new info to target node : nacos-2.nacos-headless.default.svc.cluster.local.:8848, error : caused: 2,500 milliseconds timeout on connection http-outgoing-115 [ACTIVE]; 2023-02-01 14:28:22,140 ERROR failed to report new info to target node : nacos-2.nacos-headless.default.svc.cluster.local.:8848, error : caused: 2,500 milliseconds timeout on connection http-outgoing-615 [ACTIVE];

8

网络环境问题, 全是UnknownHostException 和请求被取消, 说明节点运行正常,但是请求发不出去。

2

com.alibaba.nacos.api.exception.NacosException: java.net.UnknownHostException: nacos-1.nacos-headless.default.svc.cluster.local. 微服务里的错误是这个,nacos-1异常了肯定是解析失败的。 我的疑惑有1点,针对nacos集群: nacos集群的地址是nacos.default.svc:8848,当nacos-1异常时,微服务一直在请求nacos-1.nacos-headless.default.svc.cluster.local.这个地址,这里有问题。客户现场每次异常都是凌晨,时间也不确定,客户反馈的时候基本自愈了,按道理nacos-1故障的时候nacos.default.svc的endpoint里只有nacos-0和nacos-2,为什么微服务还是去请求nacos-1.nacos-headless.default.svc.cluster.local.这个地址,为什么请求不能发送到nacos-0.nacos-headless.default.svc.cluster.local. 和 nacos-2.nacos-headless.default.svc.cluster.local.这两个节点呢?

7

这就是你环境的问题,你需要自己排查你的k8s环境为什么还在请求,如果这个节点挂了(DOWN)状态,无论是客户端请求还是服务端转发都不应该过去。

既然有报错说明其他节点认为他是UP,或者svc认为他是UP。