[alibaba/nacos]nacos1.4.2网络异常处理机制存在缺陷,没有设置连接池请求超时,长时间压力场景运行持续内存泄露导致不间断FullGC停止服务

2024-07-18 146 views
9

Describe the bug 网络异常时nacos.log异常日志片段:

2021-12-07 17:50:12,573 ERROR CONSOLE

org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken pipe
    at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:351)
    at org.apache.catalina.connector.OutputBuffer.appendByteArray(OutputBuffer.java:746)
    at org.apache.catalina.connector.OutputBuffer.append(OutputBuffer.java:675)
    at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:386)
    at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:364)
    at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:96)
    at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:89)
    at org.springframework.util.StreamUtils.copy(StreamUtils.java:106)
    at org.springframework.http.converter.ByteArrayHttpMessageConverter.writeInternal(ByteArrayHttpMessageConverter.java:70)
    at org.springframework.http.converter.ByteArrayHttpMessageConverter.writeInternal(ByteArrayHttpMessageConverter.java:39)
    at org.springframework.http.converter.AbstractHttpMessageConverter.write(AbstractHttpMessageConverter.java:227)
    at org.springframework.web.servlet.mvc.method.annotation.AbstractMessageConverterMethodProcessor.writeWithMessageConverters(AbstractMessageConverterMethodProcessor.java:284)
    at org.springframework.web.servlet.mvc.method.annotation.HttpEntityMethodProcessor.handleReturnValue(HttpEntityMethodProcessor.java:226)
    at org.springframework.web.method.support.HandlerMethodReturnValueHandlerComposite.handleReturnValue(HandlerMethodReturnValueHandlerComposite.java:82)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:123)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:892)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:797)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1040)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:626)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:733)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.CorsFilter.doFilterInternal(CorsFilter.java:97)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.boot.actuate.web.trace.servlet.HttpTraceFilter.doFilterInternal(HttpTraceFilter.java:88)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at com.alibaba.nacos.naming.web.DistroFilter.doFilter(DistroFilter.java:157)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at com.alibaba.nacos.core.auth.AuthFilter.doFilter(AuthFilter.java:69)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at com.alibaba.nacos.naming.web.TrafficReviseFilter.doFilter(TrafficReviseFilter.java:75)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:209)
    at org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:178)
    at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:358)
    at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:271)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:94)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.filterAndRecordMetrics(WebMvcMetricsFilter.java:114)
    at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:104)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:374)
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
    at org.apache.tomcat.util.net.NioChannel.write(NioChannel.java:138)
    at org.apache.tomcat.util.net.NioBlockingSelector.write(NioBlockingSelector.java:101)
    at org.apache.tomcat.util.net.NioSelectorPool.write(NioSelectorPool.java:152)
    at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite(NioEndpoint.java:1253)
    at org.apache.tomcat.util.net.SocketWrapperBase.doWrite(SocketWrapperBase.java:764)
    at org.apache.tomcat.util.net.SocketWrapperBase.writeBlocking(SocketWrapperBase.java:584)
    at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:528)
    at org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.doWrite(Http11OutputBuffer.java:546)
    at org.apache.coyote.http11.filters.IdentityOutputFilter.doWrite(IdentityOutputFilter.java:73)
    at org.apache.coyote.http11.Http11OutputBuffer.doWrite(Http11OutputBuffer.java:193)
    at org.apache.coyote.Response.doWrite(Response.java:601)
    at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:339)
    ... 92 common frames omitted
2021-12-07 17:50:12,606 WARN Failure in @ExceptionHandler private org.springframework.http.ResponseEntity<java.lang.String> com.alibaba.nacos.console.exception.ConsoleExceptionHandler.handleException(java.lang.Exception)

最终FullGC自动堆转储,MAT分析可疑内存泄露对象org.apache.http.impl.nio.conn.CPool占了7G内存: nacos-memory-leak

回答

8

nacos-memory-leak-1

5
@ControllerAdvice
public class ConsoleExceptionHandler {
...
    @ExceptionHandler(Exception.class)
    private ResponseEntity<String> handleException(Exception e) {
        LOGGER.error("CONSOLE", e);
        <=== ClientAbortException时应返回null
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(ExceptionUtil.getAllExceptionMsg(e));
    }
...
4

还有一个问题是nacos使用的sofa组件所依赖的netty版本有Received DATA frame for an unknown stream缺陷, 详见: https://github.com/grpc/grpc-java/issues/3548 https://github.com/netty/netty/pull/9798 grpc-java的1.28版本做了修复。 下面的告警信息是和前面所说的ClientAbortException一起出现:

2021-12-08 02:53:55,908 WARN Stream Error

io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Received DATA frame for an unknown stream 13
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:147)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.shouldIgnoreHeadersOrDataFrame(DefaultHttp2ConnectionDecoder.java:591)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onDataRead(DefaultHttp2ConnectionDecoder.java:239)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onDataRead(Http2InboundFrameLogger.java:48)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readDataFrame(DefaultHttp2FrameReader.java:422)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:251)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:160)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:174)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378)
    at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
    at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
    at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
    at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
    at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421)
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:794)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:424)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:326)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

问题是nacos用的sofa组件rpc-grpc-impl现在发布的版本都是依赖的grpc-java 1.17.0,不知道哪个会兼容grpc-java 1.28.0+, 试了几种版本组合,都存在不同的问题,比如: nacos-NoClassDefFoundError

nacos-ExceptionInInitializerError

nacos-ConsistencyException

最后试到了rpc-grpc-impl1.3.9 + grpc-java 1.28.0,运行过程中不再提示上述异常了,期间有不到1%的心跳包一直返回:

server is DOWNnow, detailed error message: Optional[The raft peer is in error: ERROR_TYPE_LOG]

后来查看集群节点元数据,发现有节点元数据里的naming_persistent_service有Error: [type=ERROR_TYPE_LOG, status=[EIO<1014>: Failed operation in LogStorage]],停掉对应节点的服务,删除节点上的data/raft/naming_persistent_service目录,然后重启服务就恢复正常了。

1

ConsoleExceptionHandler加了ClientAbortException的handler之后,还是Full GC,但是可疑对象从之前的CPool变成了LinkedList$Node : nacos-LinkedList-memory-leak nacos-LinkedList-memory-leak-1

8

一旦开始Full GC就无休止Full GC了,客户端全部停止了,nacos服务还是不间断地做Full GC,但是做也是白做,已经占用的内存一点也没有释放,控制台页面打不开,tcp请求没响应,和死了没有区别。

8

在创建Cluster时增加了Synchronized (service)之后还是会Full GC,但是堆的样子变了,原来的单个大陀分成了4份,其中最大的一份还是CPool里的leasingRequests: nacos-leasingRequests-memory-leak nacos-leasingRequests-memory-leak-1

增加同步块是压测时发现心跳请求积压时,会同时收到多个相同service的心跳,然后去创建了多个Cluster对象。

3

查了堆里残留的几个leasingRequest,deadline都是LONG.MAX_VALUE,不会被自动清理,只进不出把内存撑爆了。 nacos-leasingRequests-deadline-MAX PR #7376 做了针对性修订,初步压测通过,这个问题应该搞定了。

2

这个问题的根源应该是节点之间转发临时实例心跳包的连接池太小了,每个节点的最大连接数只有cpu核数,临时实例数量多的应用场景这么小的连接数根本就不够用,需要调整HttpClientManager的ApacheSyncHttpClientFactory的buildHttpClientConfig方法,调大连接池。

6

so,这个问题的根本原因是什么?可以怎么最小化解决?

2

如果你指我们压测为什么出现OOM,那么是因为持续不断地从连接池获取连接的海量请求多数都得不到满足,获取连接没有设置超时控制,这些得不到满足的请求堆积起来最终把内存耗尽。 如果你指为什么获取不了连接,那么是因为nacos连接池只有2倍的cpu核数,我们的注册中心各节点时16c,那么就是32个连接,nacos各节点之间需要转发临时实例心跳包到责任节点,我们压测的临时实例数量在30万-100万之间,节点之间建立连接的数量持续在500以上,32个连接远远不够用,所以这些获取连接的请求会迅速堆积。

7

请问7376的fix里只修改了timeout的设置,包不包含增加连接池?

6

目前我们用的是自己的修订版,基于版本1.4.2自己修改的,按照社区管理团队答复说,现在只修复版本2,除了重大缺陷外,版本1不会再做维护,因为nacos版本2支持不了我们的应用场景,所以下一步要换其他注册中心的软件了。 这个issue当时按照我们的修改内容拆了很多PR出来,其中哪些PR合并了、哪些没有合并我们没有再关注了,需要你自己查一下了。