Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H800训练会使loss异常增加到nan? #2406

Open
jphtd opened this issue Dec 12, 2024 · 1 comment
Open

H800训练会使loss异常增加到nan? #2406

jphtd opened this issue Dec 12, 2024 · 1 comment

Comments

@jphtd
Copy link

jphtd commented Dec 12, 2024

不知道有没有人用H800训练过。

我的个人电脑,windows系统,显卡3060,python3.10, cuda11.7,torch2.3可以正常训练,20几个epoch就可以将初始为31的loss_mel训练到15左右,得到的pth和index推理结果也正常。

计算服务器是linux系统,显卡H800,python3.9 3.10, cuda12.2, torch2.1~2.5都试过,使用1-4块H800都试过,使用和前述同样的数据和配置(基本就是gitclone下来之后没再动过),但训练时loss_mel会从30逐渐升至50,60,之后变为nan。得到的pth和index推理出的结果几乎为纯蜂鸣声。

看了其他问题的回答修改过fp16_run 为false,不起作用;减小learning_rate的初值,虽然在200轮以内不会出现nan,但得到的pth推理出的音频仍然充满电流声与蜂鸣声。

不知道是cuda版本的问题,pytorch版本的问题,还是显卡的问题?有没有用H800出现同样问题的?

@jphtd
Copy link
Author

jphtd commented Dec 14, 2024

在计算服务器上不使用gpu,直接使用cpu训练也是正常的,只要用到了gpu就会出现上面的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant