Duplicate/inconsistent records with same user id and timestamp #6

xiaoqtcd · 2021-11-22T22:35:58Z

Hi. I found that there are many duplicate records with same user id and timestamp in KT1 and KT3. For example, for user u1, in dataset KT1, there are two records with same timestamp 1567140388553 as below:
1567140388553,219,q10649,a,57500
1567140388553,219,q10648,b,57500

Another example in KT3 for user u1 with timestamp 1567115277665:
1567115277665,respond,q4790,sprint,b,mobile
1567115277665,respond,q4790,sprint,b,mobile

The first example is very confusing because of different user responses for different questions. Moreover, it seems that it is not possible to reconstruct records in KT1 with data in KT3, due to the inconsistent timestamp recorded. I am wondering whether there are some clear issues in the dataset. Is there any way to get a cleaner version? Many thanks!

kwonmha · 2022-01-20T08:47:32Z

Hello, @xiaoqtcd
I'm trying to reproduce Saint model with KT1 dataset and got worse AUC compared to other papers like LPKT, SAINT+, SAINT.
As my code worked fine with kaggle riiid dataset, I guess my results is caused by unclean dataset state.
How are your AUC or ACC with KT1 dataset?
Are they good enough?

xiaoqtcd · 2022-01-20T10:15:29Z

Hi @kwonmha , I am doing mostly unsupervised learning at the moment, so don't have results of AUC, ACC. But I did some data analysis, and found out that there are many problems inside the dataset. What is LPKT? Are you using the code for SAINT, SAINT+ provided by riiid? I found out it's actually not that straightforward even to reconstruct Kaggle Riiid dataset's format with the raw Ednet dataset. Could you explain a bit how you did that?

kwonmha · 2022-01-20T10:48:09Z

Hi, @xiaoqtcd
LPKT is the model proposed in "Learning Process-consistent Knowledge Tracing"(KDD '21)

I used SAINT models implemented from the participants of Kaggle Riiid competition.
And modified codes to deal with KT1 dataset instead of the dataset for competition.
I didn't reconstruct KT1 into Riiid format.
I think they are similar so that it only requires few modification on code to put KT1 data into SAINT for kaggle dataset(selecting columns or compare answers if its correct or not).

xiaoqtcd · 2022-01-20T13:47:57Z

Hi, @xiaoqtcd LPKT is the model proposed in "Learning Process-consistent Knowledge Tracing"(KDD '21)

I used SAINT models implemented from the participants of Kaggle Riiid competition. And modified codes to deal with KT1 dataset instead of the dataset for competition. I didn't reconstruct KT1 into Riiid format. I think they are similar so that it only requires few modification on code to put KT1 data into SAINT for kaggle dataset(selecting columns or compare answers if its correct or not).

Hi @kwonmha , thanks a lot for pointing out the paper for LPKT. It's an interesting one. But notice that, in Kaggle challenge dataset, prior_question_had_explanation and prior_question_elapsed_time are known. While in Ednet, I think, there is a need to reconstruct with KT1 and KT3 together. task_container_id needs to be reconstructed as well. I am not sure about whether cleaning EdNet plays an important part for the accuracy. But I believe reconstructing the data format correctly is very important.

kwonmha · 2022-01-24T11:32:42Z

@xiaoqtcd
As SAINT model doesn't take prior_question_had_explanation, prior_question_elapsed_time as input, I didn't try to reconstruct KT1 into kaggle format.
I didn't use them while testing kaggle dataset with SAINT and don't want to use them while testing KT1 data.

xiaoqtcd · 2022-01-24T12:10:31Z

@xiaoqtcd As SAINT model doesn't take prior_question_had_explanation, prior_question_elapsed_time as input, I didn't try to reconstruct KT1 into kaggle format. I didn't use them while testing kaggle dataset with SAINT and don't want to use them while testing KT1 data.

I see. I am not familliar with SAINT but for SAINT+, it seems to me that there are needs to reorganize the temporal info and some other info to prepare for embedding used in the model. So I thought you were doing it. There are many different versions of implementations for SAINT+ on Kaggle. It seems to me the authors from Riiid are also on Kaggle as well. We can connect on Kaggle and have more discussion if you'd like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate/inconsistent records with same user id and timestamp #6

Duplicate/inconsistent records with same user id and timestamp #6

xiaoqtcd commented Nov 22, 2021

kwonmha commented Jan 20, 2022 •

edited

Loading

xiaoqtcd commented Jan 20, 2022

kwonmha commented Jan 20, 2022 •

edited

Loading

xiaoqtcd commented Jan 20, 2022

kwonmha commented Jan 24, 2022

xiaoqtcd commented Jan 24, 2022

Duplicate/inconsistent records with same user id and timestamp #6

Duplicate/inconsistent records with same user id and timestamp #6

Comments

xiaoqtcd commented Nov 22, 2021

kwonmha commented Jan 20, 2022 • edited Loading

xiaoqtcd commented Jan 20, 2022

kwonmha commented Jan 20, 2022 • edited Loading

xiaoqtcd commented Jan 20, 2022

kwonmha commented Jan 24, 2022

xiaoqtcd commented Jan 24, 2022

kwonmha commented Jan 20, 2022 •

edited

Loading

kwonmha commented Jan 20, 2022 •

edited

Loading