Describe the bug 在k8s集群中以statefulset方式部署nacos1.4.2 cluster,在statefulset里配置了podAntiAffinity让nacos-0/nacos-1/nacos-2分别调度到3台不同的k8s node节点。最近在客户环境不定期出现其中一个nacos容器因为健康检查失败导致重启,其中一个nacos容器在重启的时候客户在使用应用程序的时候出现"负载均衡中找不到可以使用的微服务:xxxx"。查看nacos的日志文件naming-raft.log发现如下内容:
2023-02-01 02:01:08,633 INFO Raft group naming_persistent_service has leader nacos-0.nacos-headless.default.svc.cluster.local.:7848
2023-02-01 02:01:15,003 WARN [IS LEADER] no leader is available now!
2023-02-01 02:01:15,657 INFO leader timeout, start voting,leader: null, term: 0
2023-02-01 02:01:15,667 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:15,935 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:30,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:01:32,657 INFO leader timeout, start voting,leader: null, term: 1
2023-02-01 02:01:32,661 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:32,661 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:45,001 WARN [IS LEADER] no leader is available now!
2023-02-01 02:01:48,157 INFO leader timeout, start voting,leader: null, term: 2
2023-02-01 02:01:48,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:01:48,401 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:00,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:02:08,157 INFO leader timeout, start voting,leader: null, term: 3
2023-02-01 02:02:08,162 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:08,163 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:15,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:02:24,157 INFO leader timeout, start voting,leader: null, term: 4
2023-02-01 02:02:24,159 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:24,161 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:30,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:02:44,157 INFO leader timeout, start voting,leader: null, term: 5
2023-02-01 02:02:44,159 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:44,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:02:45,001 WARN [IS LEADER] no leader is available now!
2023-02-01 02:03:00,000 WARN [IS LEADER] no leader is available now!
2023-02-01 02:03:01,157 INFO leader timeout, start voting,leader: null, term: 6
2023-02-01 02:03:01,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-0.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:03:01,160 ERROR NACOS-RAFT vote failed: 500, url: http://nacos-2.nacos-headless.default.svc.cluster.local.:8848/nacos/v1/ns/raft/vote
2023-02-01 02:03:15,001 WARN [IS LEADER] no leader is available now!
2023-02-01 02:03:15,121 WARN start to close old raft protocol!!!
2023-02-01 02:03:15,121 WARN stop old raft protocol task for notifier
2023-02-01 02:03:15,121 WARN stop old raft protocol task for master task
2023-02-01 02:03:15,121 WARN stop old raft protocol task for heartbeat task
2023-02-01 02:03:15,121 WARN clean old cache datum for old raft
2023-02-01 02:03:15,121 WARN start to close old raft protocol!!!
2023-02-01 02:03:15,121 WARN stop old raft protocol task for notifier
2023-02-01 02:03:15,121 WARN stop old raft protocol task for master task
2023-02-01 02:03:15,121 WARN stop old raft protocol task for heartbeat task
2023-02-01 02:03:15,121 WARN clean old cache datum for old raft
2023-02-01 02:03:15,121 WARN start to move old raft protocol metadata
2023-02-01 02:03:15,121 WARN start to move old raft protocol metadata
看日志猜测是3个节点的nacos在其中的1个节点重启后,剩余的2个节点raft选举失败,导致nacos集群没有leader可用。raft协议剩余2个节点的时候选举应该加一些算法来保证可以选举出leader节点吧?这个在1.4.2的版本如何避免再次出现呢?
Expected behavior A clear and concise description of what you expected to happen.
Actually behavior A clear and concise description of what you actually to happen.
How to Reproduce Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See errors
Desktop (please complete the following information):
- OS: [e.g. Centos]
- Version [e.g. nacos-server 1.3.1, nacos-client 1.3.1]
- Module [e.g. naming/config]
- SDK [e.g. original, spring-cloud-alibaba-nacos, dubbo]
Additional context Add any other context about the problem here.