Training not converging with default settings #15

AmarHek · 2023-07-05T14:57:11Z

Hi,
we're from the University of Wuerzburg and tried to replicate your project for German report data.
For now, we simply tried to get your code to run and train on MIMIC with the default settings provided as well as the settings provided in your paper. Of course, we made sure to have the same package versions as in the project.

However, we quickly get NaN loss after some iterations. So first, we tried to create a subsample of the dataset. For a very small dataset (~300 images), the training does converge. However, even for 1000 images the loss does not get smaller. We also tried several different learning rates and hyperparameters, but nothing helped so far.

I was hoping that you might be familiar with our problems and give us advice here.

Thanks in advance!

felipezeiser · 2024-02-02T15:20:40Z

Hi AmarHek,

Did you solve this issue? I'm trying to run the code and having the same trouble.

Thanks.

AmarHek · 2024-02-02T16:42:40Z

Hi felipezeiser!

we actually did solve it, it was a huge mistake on our side!
Our Github Repo was set to use LFS and we had the reports inside the repo, which lead to the contents simply being links to github LFS.
When we actually trained on the proper reports, we had no problems running the code.

Maybe you have a similar issue on your end, fingers crossed!

Kind regards
Amar

felipezeiser · 2024-02-02T16:58:50Z

Thank you very much for the quick response.

Unfortunately it is not the same problem as we all have cases on a secondary HD in the cluster. And apparently the reports are being sent correctly to TextEncoder.

If it doesn't bother me to ask a few more questions, how much did you use a batch size? Did you evaluate other parameters besides the default ones?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training not converging with default settings #15

Training not converging with default settings #15

AmarHek commented Jul 5, 2023

felipezeiser commented Feb 2, 2024

AmarHek commented Feb 2, 2024

felipezeiser commented Feb 2, 2024

Training not converging with default settings #15

Training not converging with default settings #15

Comments

AmarHek commented Jul 5, 2023

felipezeiser commented Feb 2, 2024

AmarHek commented Feb 2, 2024

felipezeiser commented Feb 2, 2024