[apache/dubbo]连接Zookeeper , ConnLoss 导致OOM

2024-07-02 796 views
8

依赖版本 zookeeper: 3.4.9

        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-spring-boot-starter</artifactId>
            <version>2.7.8</version>
        </dependency>

        <dependency>
            <groupId>org.apache.curator</groupId>
            <artifactId>curator-recipes</artifactId>
            <version>2.10.0</version>
        </dependency>

dubbo配置

dubbo.protocol.lazy=true
dubbo.protocol.heartbeat=3000
dubbo.registry.address=
dubbo.registry.register=false
dubbo.consumer.check=false
dubbo.application.qos.enable=false
dubbo.registry.file=./dubboregistry/dubbo-registry.properties
dubbo.provider.timeout=10000
dubbo.provider.threadpool=fixed
dubbo.provider.threads=100
dubbo.provider.protocol=dubbo
dubbo.provider.cluster=failfast
dubbo.provider.loadbalance=roundrobin
dubbo.provider.server=netty
dubbo.consumer.lazy=true
dubbo.consumer.timeout=10000
dubbo.application.logger=slf4j

遇到的错误日志如下:

2021-11-11 09:49:36 | siac | * | main-SendThread(10.135.101.101:2181) | INFO  | org.apache.zookeeper.ClientCnxn |  | ClientCnxn.java:1096 | Client session timed out, have not heard from server in 26676ms for sessionid 0x17a78d8154236ec, closing socket connection and attempting reconnect
2021-11-11 09:49:36 | siac | * | main-EventThread | INFO  | o.a.curator.framework.state.ConnectionStateManager |  | ConnectionStateManager.java:228 | State change: SUSPENDED
2021-11-11 09:49:37 | siac | * | main-SendThread(10.135.101.101:2181) | INFO  | org.apache.zookeeper.ClientCnxn |  | ClientCnxn.java:975 | Opening socket connection to server 10.135.101.101/10.135.101.101:2181. Will not attempt to authenticate using SASL (unknown error)
2021-11-11 09:49:37 | siac | * | main-SendThread(10.135.101.101:2181) | INFO  | org.apache.zookeeper.ClientCnxn |  | ClientCnxn.java:852 | Socket connection established to 10.135.101.101/10.135.101.101:2181, initiating session
2021-11-11 09:49:37 | siac | * | main-SendThread(10.135.101.101:2181) | INFO  | org.apache.zookeeper.ClientCnxn |  | ClientCnxn.java:1235 | Session establishment complete on server 10.135.101.101/10.135.101.101:2181, sessionid = 0x17a78d8154236ec, negotiated timeout = 40000
2021-11-11 09:49:37 | siac | * | main-EventThread | INFO  | o.a.curator.framework.state.ConnectionStateManager |  | ConnectionStateManager.java:228 | State change: RECONNECTED
2021-11-11 09:49:37 | siac | * | main-SendThread(10.135.101.101:2181) | INFO  | org.apache.zookeeper.ClientCnxn |  | ClientCnxn.java:1098 | Unable to read additional data from server sessionid 0x17a78d8154236ec, likely server has closed socket, closing socket connection and attempting reconnect
2021-11-11 09:49:38 | siac | * | Curator-ConnectionStateManager-0 | INFO  | o.a.dubbo.registry.zookeeper.ZookeeperRegistry |  | ZookeeperRegistry.java:304 |  [DUBBO] Fetching the latest urls of [consumer://10.135.101.104/com.authority.service.IAuthorityService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.authority.service.IAuthorityService&lazy=true&logger=slf4j&metadata-type=remote&methods=getRoleInfoByRoleEnName,updateRole,addResources,getAuthorithByUsername,deleteResource,getOperateLogList,addLog,disableRole,getUsefulRoles,getResourcesDetail,batchInsertAuthUserRole,getRoles,powerList,getResourcesTree,getUserByUsername,getResourcesList,getRole,getAuthTreeByRoleId,selectBdList,getUserInfoByRoleId,getRoleByUsername,modifyResource,getUserInfoByRoleEnName,powerRedis,userRole,addRole,getRolesByResKey&pid=948&qos.enable=false&release=2.7.8&side=consumer&sticky=false&timeout=10000&timestamp=1636446226827, consumer://10.135.101.104/com.rpc.goods.service.IGoodsService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.rpc.goods.service.IGoodsService&lazy=true&logger=slf4j&metadata-type=remote&methods=queryGoodsLogPage,queryGoodsPage,getCategoryList,queryGoodsList,getCategoryTree&pid=948&qos.enable=false&release=2.7.8&revision=1.1.0&side=consumer&sticky=false&timeout=10000&timestamp=1636446227578, consumer://10.135.101.104/com.rpc.goods.service.ICategoryService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.rpc.goods.service.ICategoryService&lazy=true&logger=slf4j&metadata-type=remote&methods=listSelect,deleteById,searchFirstCategoryDetail,list,searchSecondCategoryDetail,saveCategoryInfo,queryAttr,queryAttrById,saveSecondCategory,updateById&pid=948&qos.enable=false&release=2.7.8&revision=1.1.0&side=consumer&sticky=false&timeout=10000&timestamp=1636446227716, consumer://10.135.101.104/com.rpc.goods.service.GoodsManagerService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.rpc.goods.service.GoodsManagerService&lazy=true&logger=slf4j&metadata-type=remote&methods=updateGoods,pullOffShelves,reject,deleteGoods,approve,addGoods,queryGoods,executeGoodsJob,applyReviewGoods,getGoodsDetail,putOnShelves&pid=948&qos.enable=false&release=2.7.8&revision=1.1.0&side=consumer&sticky=false&timeout=10000&timestamp=1636446227537, consumer://10.135.101.104/com.rpc.coupon.service.ICouponModelServ?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.rpc.coupon.service.ICouponModelServ&lazy=true&logger=slf4j&metadata-type=remote&methods=getCouponModelListByCouponSeq,addCouponModel,getMatchedCouponModel,getCouponModelByCouponSeq,getAllCouponModelsByCouponSeq,getAllCouponModels&pid=948&qos.enable=false&release=2.7.8&side=consumer&sticky=false&timeout=10000&timestamp=1636446227674, consumer://10.135.101.104/com.sso.service.IOrgSrv?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.sso.service.IOrgSrv&lazy=true&logger=slf4j&metadata-type=remote&methods=orgTreeFilter,getOrgInfoById,updateOrgInfo,getOrgInfo,getAllProvince,orgTree,getAreaByCity,getCityByProvince,selectCityByOrg,addOrgInfo,getAllOrg,getOrgInfoByCode&pid=948&qos.enable=false&release=2.7.8&side=consumer&sticky=false&timeout=10000&timestamp=1636446227482, consumer://10.135.101.104/com.activity.dubboService.IActivityService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.activity.dubboService.IActivityService&lazy=true&logger=slf4j&metadata-type=remote&methods=getMmpCouponList,updateAllUpComingActivityStatus,getAllUpComingActivityByActivityType,getOrderShareActivityByOrderAmount,offerByThirdCoupons,getAllForthComingActivityByActivityType,offerByThirdCoupon,queryShopActivitySettingById,smsActivityBrandCodeToUser,queryShopActivitySettingDetailByCondition,getActivityCouponList,getRedEnvelopeActivity,updateShopActivitySetting,getUserRewardsDetail,queryActivityListByCondition,queryShopActivitySettingByCondition,updateAllForthComingActivityStatus,getOrderShareActivityDetail&pid=948&qos.enable=false&release=2.7.8&side=consumer&sticky=false&timeout=10000&timestamp=1636446227636, consumer://10.135.101.104/com.authority.service.UserSynService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.authority.service.UserSynService&lazy=true&logger=slf4j&metadata-type=remote&methods=queryLoginUserInfo,queryUserInfoDetail,queryUserInfoList,synUserInfo,getSynUserInfo&pid=948&qos.enable=false&release=2.7.8&side=consumer&sticky=false&timeout=10000&timestamp=1636446227130, consumer://10.135.101.104/com.rpc.goods.service.GoodsUserService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.rpc.goods.service.GoodsUserService&lazy=true&logger=slf4j&metadata-type=remote&methods=selectByUserName,queryUserInfoDetail,save,selectById,updateByIdSelective&pid=948&qos.enable=false&release=2.7.8&revision=1.1.0&side=consumer&sticky=false&timeout=10000&timestamp=1636446227230], dubbo version: 2.7.8, current host: 10.135.101.104
2021-11-11 09:49:40 | siac | * | Curator-Framework-0 | INFO  | o.a.curator.framework.state.ConnectionStateManager |  | ConnectionStateManager.java:228 | State change: LOST
2021-11-11 09:49:41 | siac | * | Curator-Framework-0 | INFO  | o.apache.curator.framework.recipes.cache.TreeCache |  | TreeCache.java:469 | Unknown event CuratorEventImpl{type=WATCHED, resultCode=-4, path='null', name='null', children=null, context=null, stat=null, data=null, watchedEvent=WatchedEvent state:Disconnected type:None path:null, aclList=null}
2021-11-11 09:49:41 | siac | * | Curator-Framework-0 | INFO  | o.apache.curator.framework.recipes.cache.TreeCache |  | TreeCache.java:469 | Unknown event CuratorEventImpl{type=WATCHED, resultCode=-4, path='null', name='null', children=null, context=null, stat=null, data=null, watchedEvent=WatchedEvent state:Disconnected type:None path:null, aclList=null}
2021-11-11 09:49:48 | siac | * | DubboRegistryRetryTimer-thread-1 | INFO  | o.apache.dubbo.registry.retry.FailedSubscribedTask |  | AbstractRetryTask.java:121 |  [DUBBO] retry subscribe : consumer://10.135.101.104/com.authority.service.IAuthorityService?application=*&category=providers,configurators,routers&check=false&dubbo=2.0.2&init=false&interface=com.authority.service.IAuthorityService&lazy=true&logger=slf4j&metadata-type=remote&methods=getRoleInfoByRoleEnName,updateRole,addResources,getAuthorithByUsername,deleteResource,getOperateLogList,addLog,disableRole,getUsefulRoles,getResourcesDetail,batchInsertAuthUserRole,getRoles,powerList,getResourcesTree,getUserByUsername,getResourcesList,getRole,getAuthTreeByRoleId,selectBdList,getUserInfoByRoleId,getRoleByUsername,modifyResource,getUserInfoByRoleEnName,powerRedis,userRole,addRole,getRolesByResKey&pid=948&qos.enable=false&release=2.7.8&side=consumer&sticky=false&timeout=10000&timestamp=1636446226827, dubbo version: 2.7.8, current host: 10.135.101.104
2021-11-11 09:49:48 | siac | * | Curator-Framework-0 | INFO  | o.apache.curator.framework.recipes.cache.TreeCache |  | TreeCache.java:469 | Unknown event CuratorEventImpl{type=WATCHED, resultCode=-4, path='null', name='null', children=null, context=null, stat=null, data=null, watchedEvent=WatchedEvent state:Disconnected type:None path:null, aclList=null}
2021-11-11 09:49:47 | siac | * | main-SendThread(10.135.101.101:2181) | WARN  | org.apache.zookeeper.ClientCnxn |  | ClientCnxn.java:1102 | Session 0x17a78d8154236ec for server 10.135.101.101/10.135.101.101:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2021-11-11 09:49:47 | siac | * | Curator-ConnectionStateManager-0 | WARN  | o.a.d.r.zookeeper.curator.CuratorZookeeperClient |  | CuratorZookeeperClient.java:376 |  [DUBBO] Curator zookeeper connection recovered from connection lose, reuse the old session 17a78d8154236ec, dubbo version: 2.7.8, current host: 10.135.101.104
2021-11-11 09:49:47 | siac | * | Curator-ConnectionStateManager-0 | WARN  | o.a.dubbo.registry.zookeeper.ZookeeperRegistry |  | ZookeeperRegistry.java:89 |  [DUBBO] Trying to fetch the latest urls, in case there're provider changes during connection loss.
 Since ephemeral ZNode will not get deleted for a connection lose, there's no need to re-register url of this instance., dubbo version: 2.7.8, current host: 10.135.101.104
2021-11-11 10:23:46 | siac | * | main-EventThread | ERROR | o.a.curator.framework.imps.CuratorFrameworkImpl |  | CuratorFrameworkImpl.java:557 | Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:516)
    at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:166)
    at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:593)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

内存溢出的直接原因在: image

绝大多数的 ClientCnxn$Packet 的 AsyncCallback cb 为 GetChildrenBuilderImpl。

存活对象: image

遇到过两次这种问题。

回答

7

这个应该是curator低版本的缺陷吧,目前master分支使用的curator-recipes版本已经升到4.2.0了

8

这个应该是curator低版本的缺陷吧,目前master分支使用的curator-recipes版本已经升到4.2.0了

之前使用的是 alibaba dubbo , 使用的curator的版本也是2.x版本, 没见过这种错误, 最近使用了 apache dubbo , 也是这两天才发现这个问题的, 我明天再试试 4.2.0 的curator , 看还会不会碰到这种问题。

7

看图片你的Xmx貌似还不到400M,而且存活的最大对象也不到40M,内存泄露的可能性不大。 如果你没有配置zk连接断开重连最大次数,默认就是无限次重连,如果网络不稳定,不间断地重连会消耗很多内存。

8

看图片你的Xmx貌似还不到400M,而且存活的最大对象也不到40M,内存泄露的可能性不大。 如果你没有配置zk连接断开重连最大次数,默认就是无限次重连,如果网络不稳定,不间断地重连会消耗很多内存。

好的, 我处理一下断开重连和dubbo版本的问题, 这次碰到的不断重连的问题, 我无法重现, 请问有什么手段复现吗?

堆的一些信息 image image

9

将应用正常启动后, 使用 tc 命令模拟网络异常, 将zk服务端的包延时设置为了5秒钟, 复现了之前的情况。 将日志级别调整为debug , 发现在重连阶段,产生了大量的 WatchEvent 事件。 使用 arthas 的 watch 命令发现, ZooKeeper#getChildWatches 有3500个, ZooKeeper#getDataWatches 也有3500个。 但是查看zk中的dubbo接口,总共只有280个左右,而本应用依赖的接口只有6个(本应用为consumer)。

2021-11-12 21:26:12 | siac |  | main-EventThread | DEBUG | o.apache.curator.framework.recipes.cache.TreeCache |  | TreeCache.java:341 | process: WatchedEvent state:Disconnected type:None path:null

Event事件打印完毕之后就是打印ConnLoss错误日志

2021-11-12 21:26:13 | siac | evcard-goods-back | Curator-Framework-0 | ERROR | o.a.curator.framework.imps.CuratorFrameworkImpl |  | CuratorFrameworkImpl.java:557 | Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

上述两个步骤不断的执行,如果 EventThread 中的 waitingEvents队列的消费速度跟不上生产速度,最终导致了OOM。

将dubbo版本调整到 2.7.9 之后,发现这种日志少了很多。 具体的原因没找出来 (=_=)