Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training not converging with default settings #15

Open
AmarHek opened this issue Jul 5, 2023 · 3 comments
Open

Training not converging with default settings #15

AmarHek opened this issue Jul 5, 2023 · 3 comments

Comments

@AmarHek
Copy link

AmarHek commented Jul 5, 2023

Hi,
we're from the University of Wuerzburg and tried to replicate your project for German report data.
For now, we simply tried to get your code to run and train on MIMIC with the default settings provided as well as the settings provided in your paper. Of course, we made sure to have the same package versions as in the project.

However, we quickly get NaN loss after some iterations. So first, we tried to create a subsample of the dataset. For a very small dataset (~300 images), the training does converge. However, even for 1000 images the loss does not get smaller. We also tried several different learning rates and hyperparameters, but nothing helped so far.

I was hoping that you might be familiar with our problems and give us advice here.

Thanks in advance!

@felipezeiser
Copy link

Hi AmarHek,

Did you solve this issue? I'm trying to run the code and having the same trouble.

Thanks.

@AmarHek
Copy link
Author

AmarHek commented Feb 2, 2024

Hi felipezeiser!

we actually did solve it, it was a huge mistake on our side!
Our Github Repo was set to use LFS and we had the reports inside the repo, which lead to the contents simply being links to github LFS.
When we actually trained on the proper reports, we had no problems running the code.

Maybe you have a similar issue on your end, fingers crossed!

Kind regards
Amar

@felipezeiser
Copy link

Thank you very much for the quick response.

Unfortunately it is not the same problem as we all have cases on a secondary HD in the cluster. And apparently the reports are being sent correctly to TextEncoder.

If it doesn't bother me to ask a few more questions, how much did you use a batch size? Did you evaluate other parameters besides the default ones?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants