Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Module Name] Bug title yarn宕机 #576

Open
2 of 3 tasks
yuhuang123456 opened this issue Jul 12, 2024 · 7 comments
Open
2 of 3 tasks

[Bug] [Module Name] Bug title yarn宕机 #576

yuhuang123456 opened this issue Jul 12, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@yuhuang123456
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

yarn集群启动过一会宕机
yarnerr

What you expected to happen

不确定是不是3.3.6版本包漏改了什么

How to reproduce

1.2.1分支,官网下载来的Hadoop3.3.6版本包,我这一共做了以下处理:

  1. cp /hadoop-3.3.3/etc/hadoop/fair-scheduler.xml /datasophon//hadoop-3.3.6/etc/hadoop/
  2. cd /datasophon/hadoop-3.3.3/etc/hadoop/新增两个空文件blacklist whitelist
    2.hdfs正常安装正常运行
    hdfs

3.yarn集群启动过一会宕机
yarnerr

日志显示并无报错
log

每次重启之后会显示上次的是kill -15
如:2024-07-12 15:58:57,029 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node ddp3:45454 cluster capacity: <memory:12144, vCores:6>
2024-07-12 16:02:34,726 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: RECEIVED SIGNAL 15: SIGTERM
2024-07-12 16:02:34,733 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2024-07-12 16:02:34,737 INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.w.WebAppContext@516592b1{cluster,/,null,STOPPED}{jar:file:/datasophon/hadoop-3.3.6/share/hadoop/yarn/hadoop-yarn-common-3.3.6.jar!/webapps/cluster}
2024-07-12 16:02:34,742 INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@464a4442{HTTP/1.1, (http/1.1)}{ddp4:8088}

ps -ef 发现nn,nm的进程还在,并且yarn也能通过命令看到服务状态
[hdfs@ddp4 datasophon]$ yarn node -list -all
2024-07-12 16:52:47,106 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
ddp4:45454 RUNNING ddp4:8042 0
ddp1:45454 RUNNING ddp1:8042 0
ddp3:45454 RUNNING ddp3:8042 0
[hdfs@ddp4 datasophon]$ yarn rmadmin -getAllServiceState
ddp1:8033 standby
ddp4:8033 active
[hdfs@ddp4 datasophon]$ ping ddp1
PING ddp1 (xxxx) 56(84) bytes of data.
64 bytes from ddp1 (xxxx): icmp_seq=1 ttl=64 time=16.6 ms
64 bytes from ddp1 (xxxx): icmp_seq=2 ttl=64 time=8.33 ms
^C
--- ddp1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 8.337/12.510/16.684/4.174 ms
[hdfs@ddp4 datasophon]$ ping ddp3
PING ddp3 (1xxxx) 56(84) bytes of data.
64 bytes from ddp3 (xxxx): icmp_seq=1 ttl=64 time=1.72 ms
64 bytes from ddp3 (xxxx): icmp_seq=2 ttl=64 time=0.540 ms

rn 8088管理页面每一个tab都显示错误
8088-1

Anything else

No response

Version

main

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@yuhuang123456 yuhuang123456 added the bug Something isn't working label Jul 12, 2024
@datasophon
Copy link
Member

You did not use DDP within the scope

@yuhuang123456
Copy link
Author

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

@chenss-1
Copy link
Contributor

chenss-1 commented Aug 1, 2024

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控,可能会监控不到状态,显示报错

@yuhuang123456
Copy link
Author

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控,可能会监控不到状态,显示报错

jmx从3.3.3复制过来了,但是prometheus_config.yml是空的,hdfs是正常监控的,yarn不行。

@yuhuang123456
Copy link
Author

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控,可能会监控不到状态,显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了,表象是hdfs能正常上传文件,mapreduce示例也可以执行

@chenss-1
Copy link
Contributor

chenss-1 commented Aug 2, 2024

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控,可能会监控不到状态,显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了,表象是hdfs能正常上传文件,mapreduce示例也可以执行

检查你得yarn-evn.sh是否配置jmx,然后检查你的Prometheus里的configs下面是否有nodemanager的配置,如果都有的话,检查你得yarn进程是否是你新启动的,还是之前安装过的遗留进程

@yuhuang123456
Copy link
Author

您没有在范围内使用DDP
必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控,可能会监控不到状态,显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了,表象是hdfs能正常上传文件,mapreduce示例也可以执行

检查你得yarn-evn.sh是否配置jmx,然后检查你的Prometheus里的configs下面是否有nodemanager的配置,如果都有的话,检查你得yarn进程是否是你新启动的,还是之前安装过的遗留进程

感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants