add util for loss spike save and decode. #1044

haikuotiankong1212 · 2024-03-21T09:35:22Z

What changes were proposed in this pull request?

针对loss尖刺的记录和解析提供了一个工具

Why are the changes needed?

为了更多训练的人来使用

codecov · 2024-03-21T10:13:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.25%. Comparing base (3157af7) to head (dd27cbf).
Report is 244 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1044      +/-   ##
==========================================
- Coverage   78.53%   78.25%   -0.29%     
==========================================
  Files         187      191       +4     
  Lines       17336    17784     +448     
==========================================
+ Hits        13615    13916     +301     
- Misses       3721     3868     +147

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

atorch/atorch/utils/loss_spike_utils.py

workingloong · 2024-03-22T06:45:01Z

You need to format your commits to pass the test of atorch-pre-commit.

skydoorkai · 2024-04-12T08:35:48Z

atorch/docs/README-LOSS-SPIKE-UTIL.md

Use English for document writing and with readable format.

skydoorkai · 2024-04-12T08:44:35Z

atorch/atorch/utils/loss_spike_utils.py

+
+
+class TokenLossSpike(LossSpikeBase):
+    def save_loss(self, file_name, cur_loss, cur_iter, losses_str, sample_infos_str):


What is the relationship between cur_loss and losses_str, or cur_iter and sample_infos_str?
What do losses_str and sample_infos_str mean in model training ?

skydoorkai · 2024-04-12T08:48:19Z

atorch/atorch/utils/loss_spike_utils.py

+            data = tokenizer.decode(data)
+        return ds, data, max_loss
+
+    def fetch(self, each_sample_info):


Since fetch is user-defined method, then either:
define it in base class as abstract method.
or
the class instance initialization should have a parameter (fetch_func) , which is provided by user.

BalaBalaYi · 2024-10-12T02:07:56Z

Does this PR still need updates and merging? If so, please reply. This PR will be closed by the end of the month if there is no response. Thanks a lot.

add util for loss spike save and decode.

a64a0b2

workingloong reviewed Mar 22, 2024

View reviewed changes

atorch/atorch/utils/loss_spike_utils.py Outdated Show resolved Hide resolved

workingloong requested a review from skydoorkai March 22, 2024 03:00

hktk added 5 commits March 27, 2024 14:21

添加注释

a077103

Format fix.

8b19ab5

Format fix 2.

5419ccf

优化注释

5e395ba

Add readme.md.

e66915f

skydoorkai reviewed Apr 12, 2024

View reviewed changes

Optimize code structure.

dd27cbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add util for loss spike save and decode. #1044

add util for loss spike save and decode. #1044

haikuotiankong1212 commented Mar 21, 2024

codecov bot commented Mar 21, 2024 •

edited

Loading

workingloong commented Mar 22, 2024

skydoorkai Apr 12, 2024

skydoorkai Apr 12, 2024

skydoorkai Apr 12, 2024

BalaBalaYi commented Oct 12, 2024



		class TokenLossSpike(LossSpikeBase):
		def save_loss(self, file_name, cur_loss, cur_iter, losses_str, sample_infos_str):

add util for loss spike save and decode. #1044

Are you sure you want to change the base?

add util for loss spike save and decode. #1044

Conversation

haikuotiankong1212 commented Mar 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

codecov bot commented Mar 21, 2024 • edited Loading

Codecov Report

workingloong commented Mar 22, 2024

skydoorkai Apr 12, 2024

Choose a reason for hiding this comment

skydoorkai Apr 12, 2024

Choose a reason for hiding this comment

skydoorkai Apr 12, 2024

Choose a reason for hiding this comment

BalaBalaYi commented Oct 12, 2024

codecov bot commented Mar 21, 2024 •

edited

Loading