Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate/inconsistent records with same user id and timestamp #6

Open
xiaoqtcd opened this issue Nov 22, 2021 · 6 comments
Open

Duplicate/inconsistent records with same user id and timestamp #6

xiaoqtcd opened this issue Nov 22, 2021 · 6 comments

Comments

@xiaoqtcd
Copy link

Hi. I found that there are many duplicate records with same user id and timestamp in KT1 and KT3. For example, for user u1, in dataset KT1, there are two records with same timestamp 1567140388553 as below:
1567140388553,219,q10649,a,57500
1567140388553,219,q10648,b,57500

Another example in KT3 for user u1 with timestamp 1567115277665:
1567115277665,respond,q4790,sprint,b,mobile
1567115277665,respond,q4790,sprint,b,mobile

The first example is very confusing because of different user responses for different questions. Moreover, it seems that it is not possible to reconstruct records in KT1 with data in KT3, due to the inconsistent timestamp recorded. I am wondering whether there are some clear issues in the dataset. Is there any way to get a cleaner version? Many thanks!

@kwonmha
Copy link

kwonmha commented Jan 20, 2022

Hello, @xiaoqtcd
I'm trying to reproduce Saint model with KT1 dataset and got worse AUC compared to other papers like LPKT, SAINT+, SAINT.
As my code worked fine with kaggle riiid dataset, I guess my results is caused by unclean dataset state.
How are your AUC or ACC with KT1 dataset?
Are they good enough?

@xiaoqtcd
Copy link
Author

Hi @kwonmha , I am doing mostly unsupervised learning at the moment, so don't have results of AUC, ACC. But I did some data analysis, and found out that there are many problems inside the dataset. What is LPKT? Are you using the code for SAINT, SAINT+ provided by riiid? I found out it's actually not that straightforward even to reconstruct Kaggle Riiid dataset's format with the raw Ednet dataset. Could you explain a bit how you did that?

@kwonmha
Copy link

kwonmha commented Jan 20, 2022

Hi, @xiaoqtcd
LPKT is the model proposed in "Learning Process-consistent Knowledge Tracing"(KDD '21)

I used SAINT models implemented from the participants of Kaggle Riiid competition.
And modified codes to deal with KT1 dataset instead of the dataset for competition.
I didn't reconstruct KT1 into Riiid format.
I think they are similar so that it only requires few modification on code to put KT1 data into SAINT for kaggle dataset(selecting columns or compare answers if its correct or not).

@xiaoqtcd
Copy link
Author

Hi, @xiaoqtcd LPKT is the model proposed in "Learning Process-consistent Knowledge Tracing"(KDD '21)

I used SAINT models implemented from the participants of Kaggle Riiid competition. And modified codes to deal with KT1 dataset instead of the dataset for competition. I didn't reconstruct KT1 into Riiid format. I think they are similar so that it only requires few modification on code to put KT1 data into SAINT for kaggle dataset(selecting columns or compare answers if its correct or not).

Hi @kwonmha , thanks a lot for pointing out the paper for LPKT. It's an interesting one. But notice that, in Kaggle challenge dataset, prior_question_had_explanation and prior_question_elapsed_time are known. While in Ednet, I think, there is a need to reconstruct with KT1 and KT3 together. task_container_id needs to be reconstructed as well. I am not sure about whether cleaning EdNet plays an important part for the accuracy. But I believe reconstructing the data format correctly is very important.

@kwonmha
Copy link

kwonmha commented Jan 24, 2022

@xiaoqtcd
As SAINT model doesn't take prior_question_had_explanation, prior_question_elapsed_time as input, I didn't try to reconstruct KT1 into kaggle format.
I didn't use them while testing kaggle dataset with SAINT and don't want to use them while testing KT1 data.

@xiaoqtcd
Copy link
Author

@xiaoqtcd As SAINT model doesn't take prior_question_had_explanation, prior_question_elapsed_time as input, I didn't try to reconstruct KT1 into kaggle format. I didn't use them while testing kaggle dataset with SAINT and don't want to use them while testing KT1 data.

I see. I am not familliar with SAINT but for SAINT+, it seems to me that there are needs to reorganize the temporal info and some other info to prepare for embedding used in the model. So I thought you were doing it. There are many different versions of implementations for SAINT+ on Kaggle. It seems to me the authors from Riiid are also on Kaggle as well. We can connect on Kaggle and have more discussion if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants