-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate/inconsistent records with same user id and timestamp #6
Comments
Hello, @xiaoqtcd |
Hi @kwonmha , I am doing mostly unsupervised learning at the moment, so don't have results of AUC, ACC. But I did some data analysis, and found out that there are many problems inside the dataset. What is LPKT? Are you using the code for SAINT, SAINT+ provided by riiid? I found out it's actually not that straightforward even to reconstruct Kaggle Riiid dataset's format with the raw Ednet dataset. Could you explain a bit how you did that? |
Hi, @xiaoqtcd I used SAINT models implemented from the participants of Kaggle Riiid competition. |
Hi @kwonmha , thanks a lot for pointing out the paper for LPKT. It's an interesting one. But notice that, in Kaggle challenge dataset, prior_question_had_explanation and prior_question_elapsed_time are known. While in Ednet, I think, there is a need to reconstruct with KT1 and KT3 together. task_container_id needs to be reconstructed as well. I am not sure about whether cleaning EdNet plays an important part for the accuracy. But I believe reconstructing the data format correctly is very important. |
@xiaoqtcd |
I see. I am not familliar with SAINT but for SAINT+, it seems to me that there are needs to reorganize the temporal info and some other info to prepare for embedding used in the model. So I thought you were doing it. There are many different versions of implementations for SAINT+ on Kaggle. It seems to me the authors from Riiid are also on Kaggle as well. We can connect on Kaggle and have more discussion if you'd like. |
Hi. I found that there are many duplicate records with same user id and timestamp in KT1 and KT3. For example, for user u1, in dataset KT1, there are two records with same timestamp 1567140388553 as below:
1567140388553,219,q10649,a,57500
1567140388553,219,q10648,b,57500
Another example in KT3 for user u1 with timestamp 1567115277665:
1567115277665,respond,q4790,sprint,b,mobile
1567115277665,respond,q4790,sprint,b,mobile
The first example is very confusing because of different user responses for different questions. Moreover, it seems that it is not possible to reconstruct records in KT1 with data in KT3, due to the inconsistent timestamp recorded. I am wondering whether there are some clear issues in the dataset. Is there any way to get a cleaner version? Many thanks!
The text was updated successfully, but these errors were encountered: