-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory management in training process #7
Comments
Hello, Thank you for your interest in our work. It is true that the version I implemented may not be memory efficient (including weights and dataset loading). Have you run your implementation? Did the final results differ greatly? Or is there a 'problem with merging coefficients not being updated' as you mentioned? I briefly modified the code according to the code construction you described, it works, and the merge coefficients are also changing. However, I did not complete all steps/epochs due to time constraints. The way I modified it is as follows:
The results are as follows:
In short, you can run your code, and if the results are not significantly different, your way seems to be a more efficient implementation. This implementation logic seems consistent to me. Sincerely, |
Thank you for the prompt response. It seems to be working, but due to resource constraints, I used a batch size of 8 and a learning rate of 4e-5. (I’m trying to reproduce ViT-L-14 & AdaMerging layerwise++ using only 1 step.) The result is as below :
By the way, it seems that the performance results for ViT-L-14, layerwise++, at 0.1% or 1% were not included in the paper. If possible, could you please share any performance data that the authors obtained for these configurations? I would greatly appreciate it.
|
Hello, Cool, your results are an improvement over Task Arithmetic(84.5%) and Ties-Merging(86.0%) on ViT-L-14. By the way, if you just want to evaluate our method, you can call the trained merge coefficients (merging_cofficient.py) directly. In the paper, we only used 0.1% or 1% evaluation for ViT-B-32, not for ViT-L-14. The configuration of the ViT-B-32 during evaluation is as follows:
Original version:
Modified version:
Original version:
Modified version:
Original version:
Modified version:
I wish you all the best. Sincerely, |
Thank you for the detailed explanation. I'll refer to the code you provided and run more experiments. Thank you! |
Hello, for epoch in range(epochs):
losses = 0.
adamerging_mtl_model.loading_weights()
for dataset_name in exam_datasets:
dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
dataloader = get_dataloader_shuffle(dataset)
for i, data in enumerate(tqdm.tqdm(dataloader)):
data = maybe_dictionarize(data)
x = data['images'].to(args.device)
y = data['labels'].to(args.device)
with autocast():
outputs = adamerging_mtl_model(x, dataset_name)
loss = softmax_entropy(outputs).mean(0)
losses += loss
if i > 0: # Execute only one step
break
I believe this code is actually performing two steps instead of one. The final condition ends the loop when i=1, which means two steps have already been completed by that point. Is that correct? Thank you. |
Hello, This could be interpreted as either two steps or one step(double the batchsize), but the latter is probably more appropriate because the loss is calculated for two batches and no parameter updates are performed. Note that Thanks. |
In your paper, you mentioned using a batch size of 16. Should I understand that, in the code, the Thank you |
I think my description is not accurate enough in paper, I am very sorry. Because I remember that I wrote So, it can be understood that I ran 500 steps with batchsize=32, or ran 1000 steps with batchsize=16. Sorry again for your trouble. Sincerely |
For additional information for @EnnengYang I am using an A6000 GPU with 48GB of memory, and I encounter a memory issue when processing the 4th dataset out of 8 datasets with provided code (16 batch, 2 steps). Therefore, it is estimated that approximately 90GB or more of GPU memory would be required, and I speculate that the author did not use this method. so I am using 16-bit floating point precision and a 1 step size as below which requires about 40GB. (16 batch, 1step)
Question : Could you let me know which GPU was used? |
For additional information for @EnnengYang and @kasurashan . I successfully reproduced the results, achieving an average score of 91.0 in layerwise++ / ViT-L-14 with a batch size of 16 and 1 step with code above. This was done over 800 epochs with a learning rate of 5e-3. Please take note. |
Hello,
Thank you for providing the codes.
The training process in provided code calculates the loss for each dataset and aggregates it to update the coefficients.
Therefore, the coefficients (lambdas) remain constant throughout the entire data iteration within a single epoch.
However, the original code (below) performs a computation where the parameters are loaded onto the CPU during each forward pass:
In my environment, loading these onto the CPU repeatedly caused memory issues.
Therefore, I modified the code as follows, loading the coefficient parameters into the model at the beginning of each epoch and processing the data accordingly.
In the training process:
Is there any aspect of this approach that differs from the author's intent, or could there be any other issues arising from this modification?
Thank you.
The text was updated successfully, but these errors were encountered: