[alibaba/nacos]"naming_persistent_service_v2" nacos raft 集群一直报错(并且无法干预)

2024-07-18 136 views
7
部署环境

k8s集群
版本:2.0.1

日志中报错

2021-09-14 14:48:12,637 ERROR Fail to refresh route configuration for group : naa ming_persistent_service_v2, status is : Status[UNKNOWN<-1>: handleRequest internn al error]

2021-09-14 14:48:22,030 ERROR Fail to refresh leader for group : naming_persistee nt_service_v2, status is : Status[UNKNOWN<-1>: Unknown leader, Unknown leader, UU nknown leader]

2021-09-14 14:48:22,031 ERROR Fail to refresh route configuration for group : naa ming_persistent_service_v2, status is : Status[UNKNOWN<-1>: handleRequest internn al error]

2021-09-14 14:48:31,435 ERROR Fail to refresh leader for group : naming_persistee nt_service_v2, status is : Status[UNKNOWN<-1>: Unknown leader, Unknown leader, UU nknown leader]

2021-09-14 14:48:31,437 ERROR Fail to refresh route configuration for group : naa ming_persistent_service_v2, status is : Status[UNKNOWN<-1>: handleRequest internn al error]

2021-09-14 14:48:40,839 ERROR Fail to refresh leader for group : naming_persistee nt_service_v2, status is : Status[UNKNOWN<-1>: Unknown leader, Unknown leader, UU nknown leader

控制台中报错
            "errMsg": "Error [type=ERROR_TYPE_STATE_MACHINE, status=Status[ESTATEMACHINE<10002>: StateMachine meet critical error when applying one or more tasks since index=25, Status[ESTATEMACHINE<10002>: StateMachine meet critical error: java.lang.NullPointerException\n\tat com.alibaba.nacos.naming.core.v2.service.impl.PersistentClientOperationServiceImpl.onInstanceDeregister(PersistentClientOperationServiceImpl.java:181)\n\tat com.alibaba.nacos.naming.core.v2.service.impl.PersistentClientOperationServiceImpl.onApply(PersistentClientOperationServiceImpl.java:157)\n\tat com.alibaba.nacos.core.distributed.raft.NacosStateMachine.onApply(NacosStateMachine.java:115)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:539)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:508)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:440)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148)\n\tat com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)\n\tat com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)\n\tat java.lang.Thread.run(Thread.java:748)\n.]]]",
解决

目前已经重启过了,k8s没有挂载共享磁盘,每次重启都可以看做一个新的实例! 一直无法恢复,线上无法注册新的持久实例!!!

回答

7

我尝试在本地复现了它。生成原因是连续发送两次注销请求。 虽然nacos在服务端做了判断,但显然在并发情况下并没有生效 InstanceOperatorClientImpl 108 行

public void removeInstance(String namespaceId, String serviceName, Instance instance) {
    boolean ephemeral = instance.isEphemeral();
    String clientId = IpPortBasedClient.getClientId(instance.toInetAddr(), ephemeral);
    if (!clientManager.contains(clientId)) { //这里并发情况下,两条注销请求都会通过
        Loggers.SRV_LOG.warn("remove instance from non-exist client: {}", clientId);
        return;
    }
    Service service = getService(namespaceId, serviceName, ephemeral);
    clientOperationService.deregisterInstance(service, instance, clientId);
}

然后两条消息都被raft同步 最后调用PersistentClientOperationServiceImpl#onInstanceDeregister

第一条请求 将clientManager 中删除 第二条请求 出现空指针异常

private void onInstanceDeregister(Service service, String clientId) { Service singleton = ServiceManager.getInstance().getSingleton(service); Client client = clientManager.getClient(clientId); client.removeServiceInstance(singleton); client.setLastUpdatedTime(); if (client.getAllPublishedService().isEmpty()) { clientManager.clientDisconnected(clientId); } NotifyCenter.publishEvent(new ClientOperationEvent.ClientDeregisterServiceEvent(singleton, clientId)); }

这种情况发生后导致无法接受新请求了,有什么办法恢复集群

5

这种情况需要修改状态机中的逻辑。保证状态机正常应用raft log. 在应用raft log中加上一些检查逻辑避免这种情况,

3

@Cczzzz 能提一个 pr 来修复吗,在 onInstanceDeregister 方法中判断一下。

0

raft下 一条日志的应用失败要导致整个集群失败并无法接受新请求,太可怕了吧。我觉得一条消息只要能确认写入成功就必须应用也成功

9

这个是 sofa-jraft 的限制。一旦状态机出问题了,就不能让 raft 继续工作了,不然就不能保证整个集群的 raft 数据强一致。

6

任意一个raft协议都是这么要求的,状态机的apply如果出现意料之外的情况,是不能继续apply raft log,如果依旧无视这种意料之外的情况继续apply,那么这个状态机的数据,已经和别的节点不一致了

6

而且raft协议是保证日志一致,而不是保证数据一致,他只是共识协议,不是一致性协议

0

那错误的log数据是需要rollback吗?比如我遇到的状态机错误ERROR_TYPE_STATE_MACHINE, 不可恢复,只能清理log,重启服务

3

这个问题有进展了吗?我们这边的服务也挂掉几次了,每次都是需要重启所有节点才能恢复,运行一段时间后,又会挂掉。

6

将 PersistentClientOperationServiceImpl onInstanceDeregister 方法修改为 : 可以避免上面的问题

private void onInstanceDeregister(Service service, String clientId) {
    Service singleton = ServiceManager.getInstance().getSingleton(service);
    Client client = clientManager.getClient(clientId);
    if (client == null) {
        Loggers.RAFT.warn("client not exist onInstanceDeregister,clientId : {} ", clientId);
        return;
    }
    client.removeServiceInstance(singleton);
    client.setLastUpdatedTime();
    if (client.getAllPublishedService().isEmpty()) {
        clientManager.clientDisconnected(clientId);
    }
    NotifyCenter.publishEvent(new ClientOperationEvent.ClientDeregisterServiceEvent(singleton, clientId));
}
9

2.1.0beta还是没有这个判空,得自行加上

6

这个官方是不是可以 合并下CR~ 最近正在合并官方master的代码~