Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用2.0.0的raft模式集群,客户端的RM和TM连接成功后一直报Decode frame error #6532

Open
1 task done
zacharias1989 opened this issue May 12, 2024 · 9 comments · May be fixed by #6551
Open
1 task done
Labels
good first issue Good for newcomers type: bug Category issues or prs related to bug.

Comments

@zacharias1989
Copy link

zacharias1989 commented May 12, 2024

  • I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

在k8s环境下使用官方的2.0.0-slim镜像创建了raft模式集群,使用业务系统客户端正常连接,seata server日志显示RM和TM都register success,client和server的version都是2.0.0。然后server就会不断报Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded,client端会对应报read timed out。

Ⅱ. Describe what happened

If there is an exception, please attach the exception trace:
11:44:24.357 INFO --- [rverHandlerThread_1_1_500] [rocessor.server.RegRmProcessor] [ onRegRmMessage] [] : RM register success,message:RegisterRMRequest{resourceIds='jdbc:mysql://192.168.0.162:3306/test', version='2.0.0', applicationId='test-service', transactionServiceGroup='default_tx_group', extraData='null'},channel:[id: 0x3fd908a9, L:/10.0.0.40:8091 - R:/10.0.0.41:55964],client version:2.0.0
11:44:51.374 ERROR --- [ettyServerNIOWorker_1_1_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded
11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ userEventTriggered] [] : channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324] read idle.
11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56324 to server channel inactive.
11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324]
11:45:06.378 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [hannelHandlerContext] [] : closeChannelHandlerContext channel:[id: 0xc8556689, L:/10.0.0.40:8091 - R:/10.0.0.41:56324]
11:45:06.379 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56324 to server channel inactive.
11:45:06.380 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xc8556689, L:/10.0.0.40:8091 ! R:/10.0.0.41:56324]
11:45:07.383 ERROR --- [ettyServerNIOWorker_1_2_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 539979109 - discarded
11:45:08.384 INFO --- [ettyServerNIOWorker_1_2_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56516 to server channel inactive.
11:45:08.384 INFO --- [ettyServerNIOWorker_1_2_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : remove unused channel:[id: 0xf3946fc7, L:0.0.0.0/0.0.0.0:8091 ! R:/10.0.0.41:56516]
11:45:24.398 ERROR --- [ettyServerNIOWorker_1_1_2] [rpc.netty.v1.ProtocolV1Decoder] [ decode] [] : Decode frame error, cause: Adjusted frame length exceeds 8388608: 1411395437 - discarded
11:45:39.399 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ userEventTriggered] [] : channel:[id: 0x86a2d5a8, L:/10.0.0.40:8091 - R:/10.0.0.41:56708] read idle.
11:45:39.400 INFO --- [ettyServerNIOWorker_1_1_2] [ty.AbstractNettyRemotingServer] [ handleDisconnect] [] : 10.0.0.41:56708 to server channel inactive.

Ⅲ. Describe what you expected to happen

按理说TM和RM已经注册成功,说明配置没有问题,什么也没有做,不应该出现这些错误

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. xxx
  2. xxx
  3. xxx

Minimal yet complete reproducer code (or URL to code):

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

  • JDK version(e.g. java -version):
  • Seata client/server version:
  • Database version:
  • OS(e.g. uname -a):
  • Others:
@slievrly
Copy link
Member

Whether the health check for port 8091 is configured. Is data consistency normal?

@zacharias1989
Copy link
Author

Whether the health check for port 8091 is configured. Is data consistency normal?

什么健康检查?看官方文档seata不是只有个空的心跳包吗?为什么会有健康检查?

@zacharias1989
Copy link
Author

Whether the health check for port 8091 is configured. Is data consistency normal?

我没有做任何额外的健康检查设置。从前端的日志看出,这应该是client端发起的watch行为,但因为包太大被服务端丢弃了,导致client端提示超时。
2024-05-10 19:19:25.817 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : watch cluster node: 10.0.0.146:8091, fail: 10.0.0.146:8091 failed to respond
2024-05-10 19:19:41.823 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : watch cluster node: 10.0.0.145:8091, fail: 10.0.0.145:8091 failed to respond
2024-05-10 19:19:43.827 ERROR 1 --- [eshMetadata_1_1] i.s.d.r.raft.RaftRegistryServiceImpl : Read timed out

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0_212]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[na:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[na:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_212]
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[httpcore-4.4.15.jar!/:4.4.15]
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[httpcore-4.4.15.jar!/:4.4.15]
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[httpcore-4.4.15.jar!/:4.4.15]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.15.jar!/:4.4.15]
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.15.jar!/:4.4.15]
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.15.jar!/:4.4.15]
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.15.jar!/:4.4.15]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar!/:4.5.13]
at io.seata.common.util.HttpClientUtil.doGet(HttpClientUtil.java:120) ~[seata-all-2.0.0.jar!/:2.0.0]
at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.acquireClusterMetaData(RaftRegistryServiceImpl.java:294) [seata-all-2.0.0.jar!/:2.0.0]
at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.lambda$null$0(RaftRegistryServiceImpl.java:153) [seata-all-2.0.0.jar!/:2.0.0]
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[na:1.8.0_212]
at java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3527) ~[na:1.8.0_212]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[na:1.8.0_212]
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) ~[na:1.8.0_212]
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) ~[na:1.8.0_212]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_212]
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) ~[na:1.8.0_212]
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) ~[na:1.8.0_212]
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) ~[na:1.8.0_212]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174) ~[na:1.8.0_212]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) ~[na:1.8.0_212]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[na:1.8.0_212]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) ~[na:1.8.0_212]
at io.seata.discovery.registry.raft.RaftRegistryServiceImpl.lambda$startQueryMetadata$1(RaftRegistryServiceImpl.java:151) [seata-all-2.0.0.jar!/:2.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.77.Final.jar!/:4.1.77.Final]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

@zacharias1989
Copy link
Author

rpc.netty.v1.ProtocolV1Decoder

看到seata-server的raft集群日志有一个
WARN --- [ main] [ay.sofa.jraft.RaftGroupService] [ start] [] : RPC server is not started in RaftGroupService.
不知道有没有影响

@zacharias1989
Copy link
Author

服务端使用k8s的Statefulset配置如下:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: seata-raft
namespace: seata
spec:
podManagementPolicy: Parallel
serviceName: seata-headless
replicas: 3
template:
metadata:
labels:
app: seata-raft-svc
env: prod
annotations:
pod.alpha.kubernetes.io/initialized: "true"
spec:
affinity:
podAntiAffinity:
## 不建议调度到同一个node上
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100 # 优先级最高
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- seata-raft-svc
topologyKey: "kubernetes.io/hostname"
volumes:
- name: host-time
hostPath:
path: /etc/localtime
type: ''
- name: seata-raft-cm
configMap:
name: seata-raft-cm
items:
- key: application.yml
path: application.yml
defaultMode: 420
containers:
- name: seata-raft
imagePullPolicy: IfNotPresent
image: docker.io/seataio/seata-server:2.0.0-slim
resources: {}
ports:
- name: server
containerPort: 7091
protocol: TCP
- name: cluster
containerPort: 8091
protocol: TCP
env:

- name: seata.server.raft.server-addr

        - name: SEATA_SERVER_RAFT_SERVER_ADDR
          value: seata-raft-0.seata-headless.seata.svc.cluster.local:9091,seata-raft-1.seata-headless.seata.svc.cluster.local:9091,seata-raft-2.seata-headless.seata.svc.cluster.local:9091
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SEATA_IP
          value: $(POD_NAME).seata-headless.seata.svc.cluster.local
      volumeMounts:
        - name: host-time
          mountPath: /etc/localtime
          readOnly: true
        - name: seata-raft-cm
          readOnly: true
          mountPath: /seata-server/resources/application.yml
          subPath: application.yml
        - name: data
          mountPath: /seata-server/sessionStore
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File

volumeClaimTemplates:
- metadata:
name: data
namespace: seata
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: "sfsturbo-subpath-sc"
resources:
requests:
storage: 10Gi
selector:
matchLabels:
app: seata-raft-svc
env: prod

apiVersion: v1
kind: Service
metadata:
name: seata-headless
namespace: seata
spec:
publishNotReadyAddresses: true
ports:
- port: 7091
name: server
targetPort: 7091
- port: 8091
name: cluster
targetPort: 8091
clusterIP: None
selector:
app: seata-raft-svc
env: prod
配置文件内容如下:
apiVersion: v1
kind: ConfigMap
metadata:
name: seata-raft-cm
namespace: seata
data:
application.yml: |
server:
port: 7091

spring:
  application:
    name: seata-server

logging:
  config: classpath:logback-spring.xml
  file:
    path: ${log.home:${user.home}/logs/seata}
  extend:
    logstash-appender:
      destination: 127.0.0.1:4560
    kafka-appender:
      bootstrap-servers: 127.0.0.1:9092
      topic: logback_to_logstash

console:
  user:
    username: seata
    password: seata
seata:
  server:
    raft:
      group: default #此值代表该raft集群的group,client的事务分支对应的值要与之对应
      server-addr: ${SEATA_SERVER_RAFT_SERVER_ADDR} # 真实值为seata-raft-0.seata-headless.seata.svc.cluster.local:9091,seata-raft-1.seata-headless.seata.svc.cluster.local:9091,seata-raft-2.seata-headless.seata.svc.cluster.local:9091
      snapshot-interval: 600 # 600秒做一次数据的快照,以便raftlog的快速滚动,但是每次做快照如果内存中事务数据过多会导致每600秒产生一次业务rt的抖动,但是对于故障恢复比较友好,重启节点较快,可以调整为30分钟,1小时都行,具体按业务来,可以自行压测看看是否有抖动,在rt抖动和故障恢复中自行找个平衡点
      apply-batch: 32 # 最多批量32次动作做一次提交raftlog
      max-append-bufferSize: 262144 #日志存储缓冲区最大大小,默认256K
      max-replicator-inflight-msgs: 256 #在启用 pipeline 请求情况下,最大 in-flight 请求数,默认256
      disruptor-buffer-size: 16384 #内部 disruptor buffer 大小,如果是写入吞吐量较高场景,需要适当调高该值,默认 16384
      election-timeout-ms: 1000 #超过多久没有leader的心跳开始重选举
      reporter-enabled: false # raft自身的监控是否开启
      reporter-initial-delay: 60 # 监控的区间间隔
      serialization: jackson # 序列化方式,不要改动
      compressor: none # raftlog的压缩方式,如gzip,zstd等
      sync: true # raft日志的刷盘方式,默认是同步刷盘
  config:
    # support: nacos, consul, apollo, zk, etcd3
    type: file
  registry:
    # support: nacos, eureka, redis, zk, consul, etcd3, sofa
    type: file
  store:
    # support: file 、 db 、 redis 、 raft
    mode: raft
    file:
      dir: sessionStore
  #  server:
  #    service-port: 8091 #If not configured, the default is '${server.port} + 1000'
  security:
    secretKey: xxxx
    tokenValidityInMilliseconds: 1800000
    ignore:
      urls: /,/**/*.css,/**/*.js,/**/*.html,/**/*.map,/**/*.svg,/**/*.png,/**/*.jpeg,/**/*.ico,/api/v1/auth/login,/metadata/v1/**

@zacharias1989
Copy link
Author

客户端配置文件为:
server:
port: 8080

seata:
enabled: true
registry:
type: raft
raft:
server-addr: ${SEATA_SERVER_RAFT_SERVER_ADDR} #真实值为seata-raft-0.seata-headless.seata.svc.cluster.local:7091,seata-raft-1.seata-headless.seata.svc.cluster.local:7091,seata-raft-2.seata-headless.seata.svc.cluster.local:7091
tx-service-group: default_tx_group
service:
vgroup-mapping:
default_tx_group: default
application-id: ${spring.application.name}

@funky-eyes
Copy link
Contributor

funky-eyes commented May 13, 2024

@funky-eyes
Copy link
Contributor

String host = inetSocketAddress.getAddress().getHostAddress();
这段代码使用上有问题,当inetSocketAddress是通过new InetSocketAddress(域名,port),通过getHostAddress会获取到解析后的ip,导致出现这个issue所说的问题。
The code has an issue in its usage. When inetSocketAddress is created using new InetSocketAddress(hostname, port), invoking getHostAddress retrieves the resolved IP address, causing the problem described in this issue.

@funky-eyes funky-eyes added type: bug Category issues or prs related to bug. good first issue Good for newcomers labels May 13, 2024
@funky-eyes
Copy link
Contributor

String host = inetSocketAddress.getAddress().getHostAddress(); 这段代码使用上有问题,当inetSocketAddress是通过new InetSocketAddress(域名,port),通过getHostAddress会获取到解析后的ip,导致出现这个issue所说的问题。 The code has an issue in its usage. When inetSocketAddress is created using new InetSocketAddress(hostname, port), invoking getHostAddress retrieves the resolved IP address, causing the problem described in this issue.

直接改成getHoststring是无效的,因为raft这块给出去的是node转换成InetSocketAddress(域名,端口)->然后netty那边转成string做了域名解析变成了InetSocketAddress(ip,port)->健康检查->归还了解析后的InetSocketAddress
直接使用getHoststring得出来的可能是一个ip,无法对比node里的host,匹配不上就走8091端口了,这块涉及有点多。
方案1:将raft实现中的queryHttpAddress时对比逻辑改为创建成InetSocketAddress,然后都通过getHostAddress来对比
方案2:将NetUtil中的toStringAddress换为getHoststring的方式,不要域名解析,然后raft中的实现可以直接用getHoststring和metadata中的node里的host做对比,这块就是域名就是域名,如果raft集群是ip方式组件就是ip,不会涉及域名解析之类的也更高效,也更统一

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers type: bug Category issues or prs related to bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants