Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2021Tencent Rhino-bird Open-source Training Program—Angel--刘倩 #100

Open
xiaoSUM opened this issue Aug 5, 2021 · 0 comments
Open

Comments

@xiaoSUM
Copy link

xiaoSUM commented Aug 5, 2021

一、 angel 算法案例

1.1 LR-spark-on-angel输出

image
image
image
image

1.2 Debug

1. netty-all-4.1.1.Final.jar与json4s-jackson_2.11-3.4.2.jar版本问题

修改angel-ps与spark-on-angel的pom文件改为以上版本

2. 跑通项目的软件版本

apache-maven-3.8.1
hadoop-2.7.2
jdk1.8.0_161
protobuf-2.5.0
scala-2.11.8
spark-2.3.0-bin-hadoop2.7
angel-2.4.0-bin

二、 Pytorch on angel 算法案例

1.1 deepfm for torch on angel输出

http://hadoop001:8088/cluster/apps
image
image
image

1.2 Debug

1. cmake报错

image
在dockerfile里面添加
ENV Torch_DIR=/opt/libtorch/share/cmake/Torch

2. pytorch版本和torchvision版本不对应

在dokerfile文件里面添加torchvision=0.4.2

3. spark-submit提交脚本

source /home/liuqian/angel/angel/dist/target/angel-2.4.0-bin/bin/spark-on-angel-env.sh

4.内存问题

image
把yarn.scheduler.capacity.maximum-am-resource-percent调到0.6

5.提交脚本内存分配不合理

image
ps log
image
换一台物理内存大的机器,重新配置跟之前一样的环境,yarn设置和提交脚本如下:
image
image

@xiaoSUM xiaoSUM changed the title 犀牛鸟angel实战-刘倩 2021Tencent Rhino-bird Open-source Training Program—Angel Aug 12, 2021
@xiaoSUM xiaoSUM changed the title 2021Tencent Rhino-bird Open-source Training Program—Angel 2021Tencent Rhino-bird Open-source Training Program—Angel--刘倩 Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant