[alibaba/nacos]Naming Redo 任务存在影响最终一致性的问题

2023-12-19 460 views
5

Describe the bug

com.alibaba.nacos.client.naming.remote.gprc.redo.RedoScheduledTask 任务在反注册后马上注册时,对反注册任务 redo 会导致注册的实例失效。

Expected behavior

按照指令提交的时序保证最终一致性。

Actually behavior

redo 任务导致后提交的指令失效

How to Reproduce Steps to reproduce the behavior:

  1. 注册实例A
  2. 反注册实例A
  3. 注册实例B

实例 A 和 B 的 serviceName、ip、port 信息一致

在反注册实例 A 成功之前 com.alibaba.nacos.client.naming.remote.gprc.redo.RedoScheduledTask 扫描到对应任务执行 redo,但是在执行 redo 成功之前,主线程先注册了实例 B。最终 redo 任务的结果会覆盖注册实例 B 的信息。

Desktop (please complete the following information):

  • OS: Ubuntu
  • Version nacos-server 2.1.2, nacos-client 2.2.1-RC
  • Module naming
  • SDK original

Additional context

对应 Dubbo 的测试任务:https://pipelines.actions.githubusercontent.com/serviceHosts/7adf70df-6981-4a3e-aa1b-09d8698255ce/_apis/pipelines/1/runs/4020/signedlogcontent/23?urlExpires=2023-01-08T13%3A05%3A15.4375164Z&urlSigningMethod=HMACV1&urlSignature=Pzbubkz7lP9ljHqs6QlzKZm36icBGDVb5toAcDx5%2FPs%3D

关键日志如下:

Provider:
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO metadata.MetadataInfo:  [DUBBO] metadata revision changed: fcbd87cacae4b7833a71908ed11a73c9 -> 8baad1a3026019171d25f19813f857fd, app: App4, services: 3, dubbo version: 3.1.5-SNAPSHOT, current host: 172.20.0.6
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO client.AbstractServiceDiscovery:  [DUBBO] Metadata of instance changed, updating instance with revision 8baad1a3026019171d25f19813f857fd., dubbo version: 3.1.5-SNAPSHOT, current host: 172.20.0.6
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO client.naming: [DEREGISTER-SERVICE] public deregistering service App4 with instance: Instance{instanceId='null', ip='172.20.0.6', port=20880, weight=1.0, healthy=true, enabled=true, ephemeral=true, clusterName='null', serviceName='App4', metadata={dubbo.endpoints=[{"port":20880,"protocol":"dubbo"}], dubbo.metadata-service.url-params={"connections":"1","version":"1.0.0","dubbo":"2.0.2","release":"3.1.5-SNAPSHOT","side":"provider","port":"20880","protocol":"dubbo"}, dubbo.metadata.revision=fcbd87cacae4b7833a71908ed11a73c9, dubbo.metadata.storage-type=local, timestamp=1673172310370}}
[08/01/23 10:05:45:045 UTC] DubboServerHandler-172.20.0.6:20880-thread-200  INFO client.naming: [DEREGISTER-SERVICE] public deregistering service providers:org.apache.dubbo.samples.api.DemoService3:: with instance: Instance{instanceId='null', ip='172.20.0.6', port=20880, weight=1.0, healthy=true, enabled=true, ephemeral=true, clusterName='DEFAULT', serviceName='null', metadata={}}
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO client.naming: [REGISTER-SERVICE] public registering service App4 with instance Instance{instanceId='null', ip='172.20.0.6', port=20880, weight=1.0, healthy=true, enabled=true, ephemeral=true, clusterName='null', serviceName='App4', metadata={dubbo.endpoints=[{"port":20880,"protocol":"dubbo"}], dubbo.metadata-service.url-params={"connections":"1","version":"1.0.0","dubbo":"2.0.2","release":"3.1.5-SNAPSHOT","side":"provider","port":"20880","protocol":"dubbo"}, dubbo.metadata.revision=8baad1a3026019171d25f19813f857fd, dubbo.metadata.storage-type=local, timestamp=1673172310370}}
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO metadata.MetadataInfo:  [DUBBO] metadata revision changed: 8baad1a3026019171d25f19813f857fd -> 87d04346ba5686acfd02e97748b86896, app: App4, services: 2, dubbo version: 3.1.5-SNAPSHOT, current host: 172.20.0.6
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO client.AbstractServiceDiscovery:  [DUBBO] Metadata of instance changed, updating instance with revision 87d04346ba5686acfd02e97748b86896., dubbo version: 3.1.5-SNAPSHOT, current host: 172.20.0.6
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO client.naming: [DEREGISTER-SERVICE] public deregistering service App4 with instance: Instance{instanceId='null', ip='172.20.0.6', port=20880, weight=1.0, healthy=true, enabled=true, ephemeral=true, clusterName='null', serviceName='App4', metadata={dubbo.endpoints=[{"port":20880,"protocol":"dubbo"}], dubbo.metadata-service.url-params={"connections":"1","version":"1.0.0","dubbo":"2.0.2","release":"3.1.5-SNAPSHOT","side":"provider","port":"20880","protocol":"dubbo"}, dubbo.metadata.revision=8baad1a3026019171d25f19813f857fd, dubbo.metadata.storage-type=local, timestamp=1673172310370}}
[08/01/23 10:05:45:045 UTC] com.alibaba.nacos.client.naming.grpc.redo.0  INFO client.naming: Redo instance operation UNREGISTER for isolated@@App4
[08/01/23 10:05:45:045 UTC] Dubbo-framework-shared-scheduler-thread-7  INFO client.naming: [REGISTER-SERVICE] public registering service App4 with instance Instance{instanceId='null', ip='172.20.0.6', port=20880, weight=1.0, healthy=true, enabled=true, ephemeral=true, clusterName='null', serviceName='App4', metadata={dubbo.endpoints=[{"port":20880,"protocol":"dubbo"}], dubbo.metadata-service.url-params={"connections":"1","version":"1.0.0","dubbo":"2.0.2","release":"3.1.5-SNAPSHOT","side":"provider","port":"20880","protocol":"dubbo"}, dubbo.metadata.revision=87d04346ba5686acfd02e97748b86896, dubbo.metadata.storage-type=local, timestamp=1673172310370}}

Consumer:
[08/01/23 10:05:46:046 UTC] nacos-grpc-client-executor-nacos-3  INFO client.naming: removed ips(1) service: isolated@@App4 -> [{"ip":"172.20.0.6","port":20880,"weight":1.0,"healthy":true,"enabled":true,"ephemeral":true,"clusterName":"DEFAULT","serviceName":"isolated@@App4","metadata":{"dubbo.metadata-service.url-params":"{\"connections\":\"1\",\"version\":\"1.0.0\",\"dubbo\":\"2.0.2\",\"release\":\"3.1.5-SNAPSHOT\",\"side\":\"provider\",\"port\":\"20880\",\"protocol\":\"dubbo\"}","dubbo.endpoints":"[{\"port\":20880,\"protocol\":\"dubbo\"}]","dubbo.metadata.revision":"fcbd87cacae4b7833a71908ed11a73c9","dubbo.metadata.storage-type":"local","timestamp":"1673172310370"},"instanceHeartBeatTimeOut":15000,"ipDeleteTimeout":30000,"instanceHeartBeatInterval":5000}]
[08/01/23 10:05:46:046 UTC] nacos-grpc-client-executor-nacos-3  INFO client.naming: current ips:(0) service: isolated@@App4 -> []

以下是测试反注册时使用的 instance metadata 不一致是否会成功反注册:

        String serverAddr = System.getProperty("nacos.address", "mse-*******-p.nacos-ans.mse.aliyuncs.com");
        Properties properties = new Properties();
        properties.put("username", System.getProperty("username", "nacos"));
        properties.put("password", System.getProperty("password", "nacos"));
        properties.put(PropertyKeyConst.SERVER_ADDR, serverAddr);
        NamingService namingService = NacosFactory.createNamingService(properties);

        Instance instance1 = new Instance();
        instance1.setServiceName("TestService");
        instance1.setIp(NetUtils.getLocalHost());
        instance1.setPort(20880);
        Map<String, String> metadata = new HashMap<>();
        metadata.put("key1", "val1");
        metadata.put("key2", "val2");
        metadata.put("key3", "val3");
        instance1.setMetadata(metadata);
        instance1.setEnabled(true);
        instance1.setHealthy(true);

        namingService.registerInstance("TestService", "isolated", instance1);

        Instance instance2 = new Instance();
        instance2.setServiceName("TestService");
        instance2.setIp(NetUtils.getLocalHost());
        instance2.setPort(20880);
        Map<String, String> metadata2 = new HashMap<>();
        metadata2.put("key1", "val1");
        metadata2.put("key2", "val3");
        metadata2.put("key3", "val3");
        instance2.setMetadata(metadata2);
        instance2.setEnabled(true);
        instance2.setHealthy(true);

        namingService.deregisterInstance("TestService", "isolated", instance2);

        List<Instance> testService = namingService.getAllInstances("TestService");
        System.out.println(testService);

回答

5

Redo 任务提交的相关代码: image

image

image

8

代码里面可以分析出 redo 任务的执行是周期性的,并不是由于连接断开等异常情况触发的。因此,在 redo 任务执行的过程中如果有其他指令正在正常执行,会导致回放的问题。如果其他指令执行之后马上又对状态进行修改,会被回放的指令所覆盖。

7

可能的解决方案:

  1. 最简单:namingservice 加读写锁,获取 redo 任务的时候加写锁进行排他,保证没有在途请求,这样就不会导致回放正常指令的问题,而且锁住后也能保证没有其他线程在写入
  2. 复杂版本:请求加版本号,判断是否版本已经过期了,如果过期了就直接忽略请求
3

redo是修正断线重连后状态的任务,断链只会修改状态,redo再根据状态判断是否执行重新注册或订阅的任务。

执行注册和注销应该最后也会修改这个状态,具体问题可能要排查一下,不能简单进行加锁。

另外版本号并不能解决问题,版本号和现在的状态是一样的,并发修改会导致执行前后顺序问题。

5

实例 A 和 B 的 serviceName、ip、port 信息一致,我理解在客户端的视角看来这就是同一个实例吧,那这个问题的本质其实是两个线程同时对一个实例进行操作,一个注册,一个反注册,最终状态由后完成的线程决定,我理解这里应该没啥问题。

不考虑批量注册的时候,redoData的key就是serviceName+groupName,nacos-client的注册和反注册接口也没有对instance进行判重,参数传进去就忠实操作。同一个 namingService 对象操作同一个instance,自然是哪个操作在后面生效就以哪个操作为终态。

2

最终一致性没问题,但是指令有序性在某个瞬间会有问题的。如果A和B的服务信息,ip,port都一样,那在A已经注册的情况下,又对A做反注册,然后又并发对B做注册,那B注册的结果可能会被A给覆盖,在这一时候查询拿到的结果就是错的。但是因为A反注册的没有Redo任务,而B的注册redo一直在,B会被重新注册上去。所以从整体来看,最终一致性没问题。但是,在A刚反注册B刚注册那会,可能会有问题

7

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

8

这里两个问题 1、A的反注册也是有redo任务的,并且A的redo和B的redo是同一个任务。如果B注册的cache动作在前,A的反注册完成动作在后,A反注册完成之后,redo任务被remove掉了,终态是注销完成。 2、A和B的服务信息、ip、port都一样,在namingService看来它们就是同一个实例。这种情况下我理解应该由使用方来保证指令的有序性符合使用方的预期。

7

又反复看了几遍代码,我觉得在这样一个场景下还真的可能存在问题: 按时间顺序描述一下: 1、对于已经注册某实例A,执行反注册操作,执行过程中,连接断开 2、onDisconnect()方法执行完毕,实例A的redoData状态为 : registerd = false, unregistering = true 3、连接恢复,redo task 周期任务开始执行,判断实例A的redoType 为 Remove; 4、redoType判断操作执行完毕之后,remove动作开始之前,有另外一个线程开始执行注册动作,该线程先执行 cacheInstanceForRedo方法,锁住 registerdInstances 这个map,更新实例A的redoData 5、redo task 在步骤4的锁释放之后,执行实例A的redoData remove 动作,直接将实例A的redoData清除 6、步骤4完成实例A的注册,但此时内存里面已经没有实例A的redoData了,再出现连接的断开-恢复,实例A并不能自动完成注册,这条实例信息丢失了。

@KomachiSion 大佬有空看看这个场景吗?