[Bug] [Module Name] Bug title yarn宕机 #576

yuhuang123456 · 2024-07-12T09:52:39Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

yarn集群启动过一会宕机

What you expected to happen

不确定是不是3.3.6版本包漏改了什么

How to reproduce

1.2.1分支，官网下载来的Hadoop3.3.6版本包，我这一共做了以下处理：

cp /hadoop-3.3.3/etc/hadoop/fair-scheduler.xml /datasophon//hadoop-3.3.6/etc/hadoop/
cd /datasophon/hadoop-3.3.3/etc/hadoop/新增两个空文件blacklist whitelist
2.hdfs正常安装正常运行

3.yarn集群启动过一会宕机

日志显示并无报错

每次重启之后会显示上次的是kill -15
如：2024-07-12 15:58:57,029 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node ddp3:45454 cluster capacity: <memory:12144, vCores:6>
2024-07-12 16:02:34,726 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: RECEIVED SIGNAL 15: SIGTERM
2024-07-12 16:02:34,733 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2024-07-12 16:02:34,737 INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.w.WebAppContext@516592b1{cluster,/,null,STOPPED}{jar:file:/datasophon/hadoop-3.3.6/share/hadoop/yarn/hadoop-yarn-common-3.3.6.jar!/webapps/cluster}
2024-07-12 16:02:34,742 INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@464a4442{HTTP/1.1, (http/1.1)}{ddp4:8088}

ps -ef 发现nn，nm的进程还在，并且yarn也能通过命令看到服务状态
[hdfs@ddp4 datasophon]$ yarn node -list -all
2024-07-12 16:52:47,106 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
ddp4:45454 RUNNING ddp4:8042 0
ddp1:45454 RUNNING ddp1:8042 0
ddp3:45454 RUNNING ddp3:8042 0
[hdfs@ddp4 datasophon]$ yarn rmadmin -getAllServiceState
ddp1:8033 standby
ddp4:8033 active
[hdfs@ddp4 datasophon]$ ping ddp1
PING ddp1 (xxxx) 56(84) bytes of data.
64 bytes from ddp1 (xxxx): icmp_seq=1 ttl=64 time=16.6 ms
64 bytes from ddp1 (xxxx): icmp_seq=2 ttl=64 time=8.33 ms
^C
--- ddp1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 8.337/12.510/16.684/4.174 ms
[hdfs@ddp4 datasophon]$ ping ddp3
PING ddp3 (1xxxx) 56(84) bytes of data.
64 bytes from ddp3 (xxxx): icmp_seq=1 ttl=64 time=1.72 ms
64 bytes from ddp3 (xxxx): icmp_seq=2 ttl=64 time=0.540 ms

rn 8088管理页面每一个tab都显示错误

Anything else

No response

Version

main

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

datasophon · 2024-07-15T08:47:39Z

You did not use DDP within the scope

yuhuang123456 · 2024-07-15T09:36:49Z

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

chenss-1 · 2024-08-01T12:55:25Z

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

yuhuang123456 · 2024-08-02T01:19:26Z

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx从3.3.3复制过来了，但是prometheus_config.yml是空的，hdfs是正常监控的，yarn不行。

yuhuang123456 · 2024-08-02T01:21:14Z

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了，表象是hdfs能正常上传文件，mapreduce示例也可以执行

chenss-1 · 2024-08-02T03:21:50Z

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了，表象是hdfs能正常上传文件，mapreduce示例也可以执行

检查你得yarn-evn.sh是否配置jmx，然后检查你的Prometheus里的configs下面是否有nodemanager的配置，如果都有的话，检查你得yarn进程是否是你新启动的，还是之前安装过的遗留进程

yuhuang123456 · 2024-08-02T10:02:23Z

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了，表象是hdfs能正常上传文件，mapreduce示例也可以执行

检查你得yarn-evn.sh是否配置jmx，然后检查你的Prometheus里的configs下面是否有nodemanager的配置，如果都有的话，检查你得yarn进程是否是你新启动的，还是之前安装过的遗留进程

感谢

yuhuang123456 added the bug Something isn't working label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Module Name] Bug title yarn宕机 #576

[Bug] [Module Name] Bug title yarn宕机 #576

yuhuang123456 commented Jul 12, 2024

datasophon commented Jul 15, 2024

yuhuang123456 commented Jul 15, 2024

chenss-1 commented Aug 1, 2024

yuhuang123456 commented Aug 2, 2024

yuhuang123456 commented Aug 2, 2024

chenss-1 commented Aug 2, 2024

yuhuang123456 commented Aug 2, 2024

[Bug] [Module Name] Bug title yarn宕机 #576

[Bug] [Module Name] Bug title yarn宕机 #576

Comments

yuhuang123456 commented Jul 12, 2024

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

datasophon commented Jul 15, 2024

yuhuang123456 commented Jul 15, 2024

chenss-1 commented Aug 1, 2024

yuhuang123456 commented Aug 2, 2024

yuhuang123456 commented Aug 2, 2024

chenss-1 commented Aug 2, 2024

yuhuang123456 commented Aug 2, 2024