[alibaba/nacos]2.0.2版本的异常,nacos报错后有部分应用从注册中心掉线,重启后才正常注册成功

2024-07-18 926 views
5

nacos的版本是2.0.2,k8s部署的集群,3个节点

2021-11-20 13:15:51--2021-11-20 13:55:50 这个时间段 nacos三个节点都报错,config-fatal.log的报错如下:

2021-11-20 13:15:07,724 ERROR [db-error] org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3000ms.

org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3000ms. at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:82) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:612) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:669) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:700) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:712) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:783) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:804) at com.alibaba.nacos.config.server.service.repository.extrnal.ExternalStoragePaginationHelperImpl.fetchPage(ExternalStoragePaginationHelperImpl.java:68) at com.alibaba.nacos.config.server.service.repository.extrnal.ExternalStoragePaginationHelperImpl.fetchPage(ExternalStoragePaginationHelperImpl.java:57) at com.alibaba.nacos.config.server.auth.ExternalPermissionPersistServiceImpl.getPermissions(ExternalPermissionPersistServiceImpl.java:74) at com.alibaba.nacos.console.security.nacos.roles.NacosRoleServiceImpl.getPermissionsFromDatabase(NacosRoleServiceImpl.java:220) at com.alibaba.nacos.console.security.nacos.roles.NacosRoleServiceImpl.getPermissions(NacosRoleServiceImpl.java:181) at com.alibaba.nacos.console.security.nacos.roles.NacosRoleServiceImpl.hasPermission(NacosRoleServiceImpl.java:143) at com.alibaba.nacos.console.security.nacos.NacosAuthManager.auth(NacosAuthManager.java:144) at com.alibaba.nacos.core.auth.RemoteRequestAuthFilter.filter(RemoteRequestAuthFilter.java:79) at com.alibaba.nacos.core.remote.RequestHandler.handleRequest(RequestHandler.java:49) at com.alibaba.nacos.core.remote.grpc.GrpcRequestAcceptor.request(GrpcRequestAcceptor.java:168) at com.alibaba.nacos.core.remote.grpc.BaseGrpcServer.lambda$addServices$0(BaseGrpcServer.java:177) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3000ms. at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:689) at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:196) at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:161) at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ... 29 common frames omitted

2021-11-20 13:15:10,920 ERROR [db-error] org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3000ms.

org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3000ms. at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:82) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:612) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:669) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:700) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:712) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:783) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:804) at com.alibaba.nacos.config.server.service.repository.extrnal.ExternalStoragePaginationHelperImpl.fetchPage(ExternalStoragePaginationHelperImpl.java:68) at com.alibaba.nacos.config.server.service.repository.extrnal.ExternalStoragePaginationHelperImpl.fetchPage(ExternalStoragePaginationHelperImpl.java:57) at com.alibaba.nacos.config.server.auth.ExternalRolePersistServiceImpl.getRolesByUserName(ExternalRolePersistServiceImpl.java:103) at com.alibaba.nacos.console.security.nacos.roles.NacosRoleServiceImpl.getRolesFromDatabase(NacosRoleServiceImpl.java:171) at com.alibaba.nacos.console.security.nacos.roles.NacosRoleServiceImpl.getRoles(NacosRoleServiceImpl.java:162) at com.alibaba.nacos.console.security.nacos.NacosAuthManager.loginRemote(NacosAuthManager.java:126) at com.alibaba.nacos.core.auth.RemoteRequestAuthFilter.filter(RemoteRequestAuthFilter.java:79) at com.alibaba.nacos.core.remote.RequestHandler.handleRequest(RequestHandler.java:49) at com.alibaba.nacos.core.remote.grpc.GrpcRequestAcceptor.request(GrpcRequestAcceptor.java:168) at com.alibaba.nacos.core.remote.grpc.BaseGrpcServer.lambda$addServices$0(BaseGrpcServer.java:177) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 3000ms. at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:689) at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:196) at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:161) at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ... 28 common frames omitted 2021-11-20 13:15:11,038 ERROR [db-error] master db down.

nacos.log的信息如下: 2021-11-20 13:06:47,659 INFO [capacityManagement] end correct usage, cost: 0s

2021-11-20 13:15:07,470 WARN HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@15ca2acf marked as broken because of SQLSTATE(null), ErrorCode(0)

com.mysql.cj.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:113) at com.mysql.cj.jdbc.StatementImpl.checkCancelTimeout(StatementImpl.java:2191) at com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:1022) at com.mysql.cj.NativeSession.execSQL(NativeSession.java:1075) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:930) at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java:1003) at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:678) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:617) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:669) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:700) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:712) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:783) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:804) at com.alibaba.nacos.config.server.service.repository.extrnal.ExternalStoragePaginationHelperImpl.fetchPage(ExternalStoragePaginationHelperImpl.java:68) at com.alibaba.nacos.config.server.service.repository.extrnal.ExternalStoragePaginationHelperImpl.fetchPage(ExternalStoragePaginationHelperImpl.java:57) at com.alibaba.nacos.config.server.auth.ExternalRolePersistServiceImpl.getRolesByUserName(ExternalRolePersistServiceImpl.java:103) at com.alibaba.nacos.console.security.nacos.roles.NacosRoleServiceImpl.getRolesFromDatabase(NacosRoleServiceImpl.java:171) at com.alibaba.nacos.console.security.nacos.roles.NacosRoleServiceImpl.getRoles(NacosRoleServiceImpl.java:162) at com.alibaba.nacos.console.security.nacos.NacosAuthManager.loginRemote(NacosAuthManager.java:126) at com.alibaba.nacos.core.auth.RemoteRequestAuthFilter.filter(RemoteRequestAuthFilter.java:79) at com.alibaba.nacos.core.remote.RequestHandler.handleRequest(RequestHandler.java:49) at com.alibaba.nacos.core.remote.grpc.GrpcRequestAcceptor.request(GrpcRequestAcceptor.java:168) at com.alibaba.nacos.core.remote.grpc.BaseGrpcServer.lambda$addServices$0(BaseGrpcServer.java:177) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2021-11-20 13:15:07,647 WARN HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@1927dc0f marked as broken because of SQLSTATE(08S01), ErrorCode(0)

2021-11-20 13:55:50 后 config-fatal.log和nacos.log 没有再报错这些了, 期间也没有重启nacos,但是有一部分应用在nacos的注册中心上找不到了,netstat查看和nacos的9948端口还有正常的连接,而且抓包发现也有通信,这些业务应用都已经运行了一段时间,有些运行了几个月。最后是重启这些注册失败的应用,然后在nacos的控制台上查看注册成功了。

现在有几个疑问: 1、这些日志报错看提示是从hikaripool拿不到有效的连接?为什么会突然出现这种报错呢? 2、nacos我们只用了注册中心功能,没有使用配置中心功能,为什么这个报错后会导致有部分应用从nacos掉线?正常来讲不管是消费者应用还是生产者应用,和nacos建立连接后,即使nacos和mysql的交互有问题也不影响应用的注册和发现功能吧? 3、prometheus暴露的指标没有hikaripool相关的信息,这些后面是否考虑暴露出来好监控查看 ?

  • OS: centos
  • Version: nacos-server 2.0.2, nacos-client 有几个
  • Module: naming

回答

1

我在2.0.3也出现这种问题,我的应用其实是已经注册在上面了,因为配置文件内容也读成功了,但是在注册中心面板上没有显示出来。

5

1、如果能复现这个场景,建议找一下客户端日志,看客户端是否失去连接 2、如果能复现这个场景,建议用客户端调一下服务发现的方法,排除一下是控制台显示错误还是真的服务掉线了

3

1、这个是客户端日志 [] grpc connection disconnect,mark to redo [] mark to redo completed [] grpc connection connect 2、服务是真的掉线,因为请求方从nacos获取不到可用的目标实例 3、nacos三个节点和一部分的业务应用的pod在同一个node机器的,所以排除网络原因, 感觉就是那一段时间nacos集群和mysql之间的连接断了,然后应用和nacos之间也断开重连了。 4、部分使用了client 2.0.1的有概率出现重新注册失败后就不注册了,client 2.0.3的都重新注册成功,这个是client 2.0.1的bug吧?

4

client 2.0.2 版本之前的自动重注册是有bug的,2.0.3版本修复了。

这个暂时给不了结论,有空的时候本地试一下,直接杀掉mysql进程,看会不会影响已注册的服务

7
"这个暂时给不了结论,有空的时候本地试一下,直接杀掉mysql进程,看会不会影响已注册的服务"

这个我测试过了,nacos和mysql不通的时候,已经运行中的应用还是可以正常调用目标应用 所以nacos和mysql不通, 应用从nacos掉线后重连,是两回事,并不是nacos和mysql不通影响了应用从nacos掉线重新注册

5

bingo,所以并不是数据库异常导致服务掉线,服务为什么会掉线还是需要进一步找原因, grpc connection disconnect,mark to redo 这个日志说明连接是真的被断开了,本地测试的时候杀掉mysql进程应该也不会导致客户端连接被断开吧?

5

不会,我当时是通过故障注入方式让nacos无法访问mysql的,然后观察config-fatal.log,发现报错和生产环境的基本上一样 而且,从日志上看,nacos和mysql连接报错,和应用从nacos掉线后重连,基本上是同一时间告警的。但是我观察nacos集群的三个节点之间的通信并没有报错日志 所以猜测,nacos集群和外界在那一段时间的通信有问题,但是在这段时间2021-11-20 13:15:51--2021-11-20 13:55:50 是断断续续的出现了几次问题,在13:55后就不再报错

0

nacos 无法访问mysql 一定会导致 config-fatal.log里面的 db-error 告警,因为无论注册中心还是配置中心都需要访问db获取用户、角色和权限配置。 从你贴出来的调用堆栈来看,你是不是把权限的 auth.caching.enabled 置为false了,导致每一笔请求都要去数据库查询校验权限,然后应用掉线了。

4

我看了下,默认就是false的,但实际模拟的时候,我让nacos和mysql不通的时候,请求方还是可以正常调用目标方的哦

0
“从你贴出来的调用堆栈来看,你是不是把权限的 auth.caching.enabled 置为false了,导致每一笔请求都要去数据库查询校验权限,然后应用掉线了。”

你说的这个每一笔请求,是仅仅指应用注册的请求吗?

8

image nacos 默认是 true的吧

服务发现的请求也会。不过nacos client 默认的服务发现策略是订阅一次,之后依赖server推送。应用注册也是注册一次,之后维护长连接就好。 所以我认为导致上述日志出现的原因是: client grpc 连接断开 -> client 重注册 -> server 访问数据库获取权限信息 -> server db error 日志 -> client 重注册失败

但是还是没有找到为什么 grpc 连接会被断开的原因。你可以多试几次,是不是注入 mysql 异常一定导致客户端 grpc connection disconnect,mark to redo

9

我是k8s部署的,有什么接口可以查询当前nacos.core.auth.caching.enabled是什么状态吗? 我登录pod内查看application.properties的配置如下:

Turn on/off caching of auth information. By turning on this switch, the update of auth information would have a 15 seconds delay.

nacos.core.auth.caching.enabled=${NACOS_AUTH_CACHE_ENABLE:false}

8

我刚才模拟了以下步骤 1、故障注入让nacos无法访问mysql,然后应用重启,无法注册成功 2、删除故障注入任务,让nacos重新正常访问mysql 3、在k8s的nacos.yaml文件添加以下变量

  • name: NACOS_AUTH_CACHE_ENABLE value: "true" 然后重启nacos,接着先让应用重新注册上来 接下来继续故障注入让nacos无法访问mysql,然后让应用重新注册,结果是注册成功了 4、得出结论如下: 4.1 nacos.core.auth.caching.enabled 在k8s的部署下,默认是false的 4.2 开启nacos.core.auth.caching.enabled为true,可以保证在nacos无法访问mysql期间,可以让应用正常注册
9

上面的步骤和结论都没有问题,符合推理 但还是不能说明为什么注入mysql故障会导致客户端grpc连接被断开。 从逻辑上说,说这两者根本就没有必然关系,是不是你看到的mysql异常,然后客户端连接被断开了,只是碰巧发生在一起的偶然。

5

嗯,从排查的结果来看,nacos和mysql异常,应用从nacos掉线重新注册,这两个是两回事,但是发生的时间是一起的,nacos内部三个节点那段时间通信是没问题的,因为有问题会有日志产生,但是并没有相关的日志 也排除了网络原因,因为nacos集群和业务应用有部署在一起,有些还是在同一台node机器上。 所以就不确定是不是nacos集群和关联方的通信是不是哪里有问题或者bug

6

可能只是集群或者网络抖动了一下吧。只要重注册OK,我觉得问题都不大。

4

嗯,是的,只要保证可以重新注册成功就还好,谢谢