Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] 序列推荐采样负样本作candidate测试时,负样本的不一致问题 #2077

Open
2020lwh567 opened this issue Aug 22, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@2020lwh567
Copy link

Describe the bug
在序列模型(如SASRec,GRU4Rec)中,当按照leave-one-out划分训练和测试集,并在测试时随机采样100个负样本,与唯一的正样本一起排序(即eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}})时,根据下面的代码执行顺序:collate_fn -> dataloader -> sample,可知对于同一个正样本,在不同epoch时会采样不同的负样本。

这样可能导致不同模型、相同数据集的结果对比时,每个模型见到的测试集难度可能不同,导致实验结果的偏差。有的模型可能本身效果差,但是采样到了简单的负样本,最终结果偏高;此外,同一个模型、同个数据集的训练过程中,每次evaluate时的验证集难度也不同,最终效果好的可能是由于采样到了简单负样本,而不是模型本身效果好。

一个可能的改进是,在数据预处理时直接保存每个positive example采样的100个负样本,从而保持测试集的一致性?

To Reproduce
执行任意一个序列模型(如Bert4Rec SASRec GRU4Rec)的代码,传入eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}}即可。

Desktop (please complete the following information):

  • OS: [Linux]
  • RecBole Version [1.2.0]
  • Python Version [3.8.19]
  • PyTorch Version [2.1.1]

非常感谢作者维护这个项目!期待您的回复,不胜感激~

@2020lwh567 2020lwh567 added the bug Something isn't working label Aug 22, 2024
@TayTroye
Copy link
Collaborator

@2020lwh567
你好,在保持seed值相同的情况下,使用不同模型所采样的负样本是相同的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants