Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: HFUT-LEC/EduStudio
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.0.0-beta2.1
Choose a base ref
...
head repository: HFUT-LEC/EduStudio
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref

Commits on Aug 30, 2023

  1. update version

    kervias committed Aug 30, 2023
    Copy the full SHA
    d4b2163 View commit details

Commits on Sep 20, 2023

  1. Copy the full SHA
    fcea303 View commit details

Commits on Oct 18, 2023

  1. Copy the full SHA
    5da8420 View commit details
  2. update version

    kervias committed Oct 18, 2023
    Copy the full SHA
    2821297 View commit details

Commits on Nov 14, 2023

  1. Copy the full SHA
    3009093 View commit details

Commits on Nov 15, 2023

  1. Merge pull request #12 from badranX/dev

    Fix argparse related error in Jupyter
    kervias authored Nov 15, 2023
    Copy the full SHA
    e92b7d1 View commit details

Commits on Nov 24, 2023

  1. Copy the full SHA
    c5460c3 View commit details

Commits on Nov 25, 2023

  1. Merge pull request #13 from badranX/qikt

    fix QIKT response data leakage
    kervias authored Nov 25, 2023
    Copy the full SHA
    ea1f688 View commit details

Commits on Dec 2, 2023

  1. Copy the full SHA
    8395d57 View commit details

Commits on Dec 3, 2023

  1. Merge pull request #14 from badranX/qdkt

    optemize QDKT laplacian matrix generation
    kervias authored Dec 3, 2023
    Copy the full SHA
    04f2622 View commit details
  2. revise DKT+ regularization

    badranX committed Dec 3, 2023
    Copy the full SHA
    5a4919f View commit details
  3. Merge pull request #15 from badranX/dkt_plus

    revise DKT+ regularization
    kervias authored Dec 3, 2023
    Copy the full SHA
    b01e624 View commit details

Commits on Dec 4, 2023

  1. Copy the full SHA
    4368f53 View commit details
  2. Copy the full SHA
    ba3cbb6 View commit details

Commits on Dec 6, 2023

  1. Copy the full SHA
    d0f771c View commit details
  2. Copy the full SHA
    7c11254 View commit details

Commits on Dec 7, 2023

  1. Merge pull request #16 from badranX/dkvmn_based

    Fix DKVMN based loss, labels-predictions mismatch
    kervias authored Dec 7, 2023
    Copy the full SHA
    18d7312 View commit details

Commits on Dec 18, 2023

  1. Copy the full SHA
    ba6a481 View commit details
  2. Copy the full SHA
    68dc147 View commit details
  3. update version to v1.0.0

    kervias committed Dec 18, 2023
    Copy the full SHA
    eea15ba View commit details

Commits on Dec 26, 2023

  1. update README

    kervias committed Dec 26, 2023
    Copy the full SHA
    8be11a7 View commit details

Commits on Dec 29, 2023

  1. Update models.md

    tzt-star authored Dec 29, 2023
    Copy the full SHA
    27a7d49 View commit details
  2. Update models.md

    tzt-star authored Dec 29, 2023
    Copy the full SHA
    908cccc View commit details

Commits on Jan 31, 2024

  1. Copy the full SHA
    c7b1280 View commit details

Commits on Feb 3, 2024

  1. Copy the full SHA
    85536e0 View commit details
  2. rename evaluate templates

    kervias committed Feb 3, 2024
    Copy the full SHA
    6eec637 View commit details
  3. Copy the full SHA
    79e6efb View commit details
  4. add [NeurIPS 2023] DCD model

    kervias committed Feb 3, 2024
    Copy the full SHA
    2158f36 View commit details

Commits on Feb 4, 2024

  1. update docs

    kervias committed Feb 4, 2024
    Copy the full SHA
    34c13e0 View commit details
  2. add AdversarialTrainTPL

    kervias committed Feb 4, 2024
    Copy the full SHA
    b55f2c6 View commit details

Commits on Feb 6, 2024

  1. Update models.md

    add FairCD
    
    Update reference_table.md
    
    add faircd
    
    Update adversarial_traintpl.py
    
    Update __init__.py
    
    add faircd
    
    Create FAIRCDDataTPL.py
    
    Update __init__.py
    
    add faircd
    
    Create faircd_irt.py
    
    Create faircd_mirt.py
    
    Create faircd_ncdm.py
    
    Create run_faircd_irt_demo.py
    
    Create run_faircd_mirt_demo.py
    
    Create run_faircd_ncdm_demo.py
    tzt-star authored and kervias committed Feb 6, 2024
    Copy the full SHA
    5e8e1e3 View commit details
  2. fix FairCD Bugs

    kervias committed Feb 6, 2024
    Copy the full SHA
    de7ae62 View commit details

Commits on Feb 9, 2024

  1. update README and docs

    kervias committed Feb 9, 2024
    Copy the full SHA
    63ec33b View commit details

Commits on Feb 11, 2024

  1. update README and version

    kervias committed Feb 11, 2024
    Copy the full SHA
    09c0693 View commit details

Commits on Feb 25, 2024

  1. reset default eval_batch_size

    kervias committed Feb 25, 2024
    Copy the full SHA
    3f659ef View commit details

Commits on Feb 28, 2024

  1. Copy the full SHA
    e935469 View commit details

Commits on Mar 5, 2024

  1. Copy the full SHA
    d4814e5 View commit details
  2. Copy the full SHA
    43d9bd8 View commit details
  3. update version to v1.1.1

    kervias committed Mar 5, 2024
    Copy the full SHA
    6aa7649 View commit details

Commits on Mar 12, 2024

  1. Copy the full SHA
    68611db View commit details

Commits on Apr 23, 2024

  1. [add] IdentifiabilityEvalTPL

    kervias committed Apr 23, 2024
    Copy the full SHA
    eb33395 View commit details

Commits on Jul 29, 2024

  1. add IDS metric

    kervias committed Jul 29, 2024
    Copy the full SHA
    071d9e9 View commit details
  2. add mf model

    kervias committed Jul 29, 2024
    Copy the full SHA
    6541b7c View commit details
  3. update version

    kervias committed Jul 29, 2024
    Copy the full SHA
    72d6a32 View commit details
  4. Copy the full SHA
    7d723a5 View commit details
  5. update setup.py

    kervias committed Jul 29, 2024
    Copy the full SHA
    789d23f View commit details

Commits on Sep 15, 2024

  1. update README

    kervias committed Sep 15, 2024
    Copy the full SHA
    c4c7413 View commit details

Commits on Dec 5, 2024

  1. fix mgcd bug

    tzt-star committed Dec 5, 2024
    Copy the full SHA
    8f6ae5d View commit details
  2. fix mgcd bug

    tzt-star committed Dec 5, 2024
    Copy the full SHA
    f4ebcaa View commit details
  3. Copy the full SHA
    115da3d View commit details
Showing with 3,381 additions and 2,021 deletions.
  1. +1 −1 .github/workflows/python-publish.yml
  2. +36 −12 README.md
  3. BIN assets/framework.png
  4. +0 −1,202 assets/framework.svg
  5. BIN docs/source/assets/dataflow.jpg
  6. BIN docs/source/assets/framework.png
  7. +1 −1 docs/source/conf.py
  8. +4 −4 docs/source/developer_guide/customize_evaltpl.md
  9. +7 −7 docs/source/developer_guide/customize_traintpl.md
  10. +2 −6 docs/source/features/atomic_files.md
  11. +18 −19 docs/source/features/atomic_operations.md
  12. +19 −15 docs/source/features/dataset_folder_protocol.md
  13. +5 −7 docs/source/features/global_cfg_obj.md
  14. +2 −2 docs/source/features/inheritable_config.md
  15. +15 −0 docs/source/features/standard_datamodule.md
  16. +2 −2 docs/source/get_started/quick_start.md
  17. +14 −9 docs/source/index.rst
  18. +21 −0 docs/source/user_guide/atom_op.md
  19. +22 −23 docs/source/user_guide/datasets.md
  20. +54 −78 docs/source/user_guide/models.md
  21. +47 −44 docs/source/user_guide/reference_table.md
  22. +25 −11 docs/source/user_guide/usage/aht.md
  23. +4 −4 docs/source/user_guide/usage/run_edustudio.md
  24. +7 −7 docs/source/user_guide/usage/use_case_of_config.md
  25. +1 −1 edustudio/__init__.py
  26. +9 −6 edustudio/assets/datasets.yaml
  27. +0 −2 edustudio/atom_op/mid2cache/CD/data_split4cd.py
  28. +4 −3 edustudio/atom_op/mid2cache/KT/__init__.py
  29. +14 −102 edustudio/atom_op/mid2cache/KT/build_seq_inter_feats.py
  30. +3 −1 edustudio/atom_op/mid2cache/KT/cpt_as_exer.py
  31. +112 −0 edustudio/atom_op/mid2cache/KT/data_split4kt.py
  32. +3 −1 edustudio/atom_op/mid2cache/KT/gen_cpt_seq.py
  33. +1 −1 edustudio/atom_op/mid2cache/KT/gen_unfold_cpt_seq.py
  34. +4 −1 edustudio/atom_op/mid2cache/common/__init__.py
  35. +1 −1 edustudio/atom_op/mid2cache/common/build_cpt_relation.py
  36. +0 −10 edustudio/atom_op/mid2cache/common/build_dtinfo.py
  37. +68 −0 edustudio/atom_op/mid2cache/common/build_missing_Q.py
  38. +112 −0 edustudio/atom_op/mid2cache/common/fill_missing_Q.py
  39. +30 −0 edustudio/atom_op/mid2cache/common/filtering_records_by_attr.py
  40. +2 −2 edustudio/atom_op/mid2cache/single/M2C_CL4KT_OP.py
  41. +12 −2 edustudio/atom_op/mid2cache/single/M2C_QDKT_OP.py
  42. +4 −1 edustudio/atom_op/raw2mid/__init__.py
  43. +1 −1 edustudio/atom_op/raw2mid/nips12.py
  44. +91 −0 edustudio/atom_op/raw2mid/slp_english.py
  45. +86 −0 edustudio/atom_op/raw2mid/slp_math.py
  46. +70 −0 edustudio/datatpl/CD/DCDDataTPL.py
  47. +7 −0 edustudio/datatpl/CD/FAIRDataTPL.py
  48. +2 −2 edustudio/datatpl/CD/RCDDataTPL.py
  49. +3 −1 edustudio/datatpl/CD/__init__.py
  50. +1 −1 edustudio/datatpl/KT/CL4KTDataTPL.py
  51. +1 −1 edustudio/datatpl/KT/DIMKTDataTPL.py
  52. +1 −1 edustudio/datatpl/KT/DKTDSCDataTPL.py
  53. +1 −1 edustudio/datatpl/KT/DKTForgetDataTPL.py
  54. +1 −1 edustudio/datatpl/KT/EERNNDataTPL.py
  55. +1 −1 edustudio/datatpl/KT/EKTDataTPL.py
  56. +1 −1 edustudio/datatpl/KT/GKTDataTPL.py
  57. +1 −1 edustudio/datatpl/KT/KTInterCptAsExerDataTPL.py
  58. +1 −1 edustudio/datatpl/KT/KTInterCptUnfoldDataTPL.py
  59. +1 −1 edustudio/datatpl/KT/KTInterDataTPL.py
  60. +1 −1 edustudio/datatpl/KT/KTInterExtendsQDataTPL.py
  61. +1 −1 edustudio/datatpl/KT/LPKTDataTPL.py
  62. +1 −1 edustudio/datatpl/KT/QDKTDataTPL.py
  63. +1 −1 edustudio/datatpl/KT/RKTDataTPL.py
  64. +16 −4 edustudio/datatpl/common/base_datatpl.py
  65. +31 −13 edustudio/datatpl/common/general_datatpl.py
  66. +1 −1 edustudio/datatpl/utils/common.py
  67. +9 −4 edustudio/datatpl/utils/pad_seq_util.py
  68. +1 −1 edustudio/datatpl/utils/spliter_util.py
  69. +4 −2 edustudio/evaltpl/__init__.py
  70. +2 −0 edustudio/evaltpl/base_evaltpl.py
  71. +0 −198 edustudio/evaltpl/cd_evaltpl.py
  72. +84 −0 edustudio/evaltpl/fairness_evaltpl.py
  73. +104 −0 edustudio/evaltpl/identifiability_evaltpl.py
  74. +390 −0 edustudio/evaltpl/interpretability_evaltpl.py
  75. +3 −1 edustudio/evaltpl/{bc_evaltpl.py → prediction_evaltpl.py}
  76. +4 −1 edustudio/model/CD/__init__.py
  77. +491 −0 edustudio/model/CD/dcd.py
  78. +318 −0 edustudio/model/CD/faircd.py
  79. +2 −2 edustudio/model/CD/irt.py
  80. +52 −0 edustudio/model/CD/mf.py
  81. +1 −1 edustudio/model/CD/mirt.py
  82. +149 −33 edustudio/model/KT/ct_ncm.py
  83. +2 −2 edustudio/model/KT/deep_irt.py
  84. +6 −2 edustudio/model/KT/dkt_plus.py
  85. +2 −2 edustudio/model/KT/dkvmn.py
  86. +3 −6 edustudio/model/KT/qikt.py
  87. +4 −3 edustudio/model/KT/sakt.py
  88. +8 −2 edustudio/quickstart/parse_cfg.py
  89. +3 −1 edustudio/settings.py
  90. +4 −1 edustudio/traintpl/__init__.py
  91. +98 −0 edustudio/traintpl/adversarial_traintpl.py
  92. +22 −3 edustudio/traintpl/atkt_traintpl.py
  93. +154 −0 edustudio/traintpl/dcd_traintpl.py
  94. +4 −6 edustudio/traintpl/gd_traintpl.py
  95. +33 −5 edustudio/traintpl/{edu_traintpl.py → general_traintpl.py}
  96. +89 −0 edustudio/traintpl/group_cd_traintpl.py
  97. +1 −6 edustudio/utils/callback/callbacks/history.py
  98. +1 −1 edustudio/utils/common/__init__.py
  99. +39 −4 edustudio/utils/common/configUtil.py
  100. +13 −2 examples/1.run_cd_demo.py
  101. +2 −2 examples/2.run_kt_demo.py
  102. +2 −2 examples/3.run_with_customized_tpl.py
  103. +8 −5 examples/5.run_with_hyperopt.py
  104. +12 −3 examples/6.run_with_ray.tune.py
  105. +2 −2 examples/single_model/run_akt_demo.py
  106. +1 −1 examples/single_model/run_atkt_demo.py
  107. +4 −3 examples/single_model/run_cdgk_demo.py
  108. +2 −2 examples/single_model/run_cdmfkc_demo.py
  109. +2 −2 examples/single_model/run_ckt_demo.py
  110. +2 −2 examples/single_model/run_cl4kt_demo.py
  111. +2 −2 examples/single_model/run_cncd_f_demo.py
  112. +2 −2 examples/single_model/run_cncdq_demo.py
  113. +2 −2 examples/single_model/run_ctncm_demo.py
  114. +54 −0 examples/single_model/run_dcd_demo.py
  115. +2 −2 examples/single_model/run_deepirt_demo.py
  116. +2 −3 examples/single_model/run_dimkt_demo.py
  117. +2 −2 examples/single_model/run_dina_demo.py
  118. +2 −2 examples/single_model/run_dkt_demo.py
  119. +2 −2 examples/single_model/run_dkt_dsc_demo.py
  120. +2 −2 examples/single_model/run_dkt_plus_demo.py
  121. +2 −2 examples/single_model/run_dktforget_demo.py
  122. +2 −2 examples/single_model/run_dkvmn_demo.py
  123. +2 −2 examples/single_model/run_dtransformer_demo.py
  124. +2 −2 examples/single_model/run_ecd_demo.py
  125. +2 −2 examples/single_model/run_eernn_demo.py
  126. +2 −2 examples/single_model/run_ekt_demo.py
  127. +25 −0 examples/single_model/run_faircd_irt_demo.py
  128. +25 −0 examples/single_model/run_faircd_mirt_demo.py
  129. +25 −0 examples/single_model/run_faircd_ncdm_demo.py
  130. +2 −2 examples/single_model/run_gkt_demo.py
  131. +2 −2 examples/single_model/run_hawkeskt_demo.py
  132. +2 −2 examples/single_model/run_hiercdf_demo.py
  133. +2 −2 examples/single_model/run_iekt_demo.py
  134. +2 −2 examples/single_model/run_irr_demo.py
  135. +2 −2 examples/single_model/run_irt_demo.py
  136. +2 −2 examples/single_model/run_kancd_demo.py
  137. +2 −2 examples/single_model/run_kqn_demo.py
  138. +2 −2 examples/single_model/run_kscd_demo.py
  139. +2 −2 examples/single_model/run_lpkt_demo.py
  140. +2 −2 examples/single_model/run_lpkt_s_demo.py
  141. +24 −0 examples/single_model/run_mf_demo.py
  142. +5 −5 examples/single_model/run_mgcd_demo.py
  143. +2 −2 examples/single_model/run_mirt_demo.py
  144. +3 −3 examples/single_model/run_ncdm_demo.py
  145. +2 −2 examples/single_model/run_qdkt_demo.py
  146. +2 −2 examples/single_model/run_qikt_demo.py
  147. +2 −2 examples/single_model/run_rcd_demo.py
  148. +2 −2 examples/single_model/run_rkt_demo.py
  149. +2 −2 examples/single_model/run_saint_demo.py
  150. +2 −2 examples/single_model/run_saint_plus_demo.py
  151. +2 −2 examples/single_model/run_sakt_demo.py
  152. +2 −2 examples/single_model/run_simplekt_demo.py
  153. +2 −2 examples/single_model/run_skvmn_demo.py
  154. +1 −1 setup.py
  155. +2 −2 tests/test_run.py
2 changes: 1 addition & 1 deletion .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@ jobs:
python -m pip install --upgrade pip
pip install build
pip install pytest
pip install torch==1.12.1 --index-url https://download.pytorch.org/whl/cpu
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -e . --verbose
pip install -r requirements.txt
- name: Test
48 changes: 36 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -6,24 +6,29 @@
<img src="https://img.shields.io/badge/pytorch-v1.10+-blue">
<img src="https://img.shields.io/badge/License-MIT-blue">
<img src="https://img.shields.io/github/issues/HFUT-LEC/EduStudio.svg">
<a href="https://journal.hep.com.cn/fcs/EN/10.1007/s11704-024-40372-3">
<img src="https://img.shields.io/badge/Paper-EduStudio-blue" alt="Paper EduStudio Badge">
</a>
</p>

EduStudio is a Unified and Templatized Framework for Student Assessment Models including Cognitive Diagnosis(CD) and Knowledge Tracing(KT) based on Pytorch.
EduStudio is a Unified Library for Student Cognitive Modeling including Cognitive Diagnosis(CD) and Knowledge Tracing(KT) based on Pytorch.

## Announcement
## Navigation

- We are working hard to reproduce the results presented in their papers for all models. These results will be published later on https://edustudio.ai/.
- We are organizing more comprehensive resources related to student assessment models to build a complete ecosystem for EduStudio.

## Description
EduStudio first decomposes the general algorithmic workflow into five steps: `configuration reading`, `data processing`, `model implementation`, `training control`, and `result evaluation`. Subsequently, to enhance the `reusability `of each step, we extract the commonalities of each algorithm at each step into individual templates for templatization.
| Resource Name | Description |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| [Eco-Repository](https://github.com/HFUT-LEC/awesome-student-cognitive-modeling) | A repository containing resources about student cognitive modeling: [papers](https://github.com/HFUT-LEC/awesome-student-cognitive-modeling/tree/main/papers), [datasets](https://github.com/HFUT-LEC/awesome-student-cognitive-modeling/tree/main/datasets), [conferences&journals](https://github.com/HFUT-LEC/awesome-student-cognitive-modeling/tree/main/conferences%26journals) |
| [Eco-Leaderboard](https://leaderboard.edustudio.ai) | A leaderboard demonstrating performance of implemented models |
| [EduStudio Documentation](https://edustudio.readthedocs.io/) | The document for EduStudio usage |
| [Reference Table](https://edustudio.readthedocs.io/en/latest/user_guide/reference_table.html) | The reference table demonstrating the corresponding templates of each model |

As illustrated in the Figure below, to better implement a templatized framework, we implement an `inheritance-style` EduStudio that contains basic architecture and inherited architecture with different responsibilities. The **basic architecture emphasizes domain-irrelevant content and strives to build templatized protocols**. The **inherited architecture obeys the protocol in the basic architecture and focuses on domain-relevant content**. The inheritance-style separates domainrelevant and domain-irrelevant content, greatly simplifying framework structure and enhancing `readability`.
## Description

The documentation is available [here](https://edustudio.readthedocs.io).
EduStudio first decomposes the general algorithmic workflow into six steps: `configuration reading`, `data prepration`, `model implementation`, `training control`, `model evaluation`, and `Log Storage`. Subsequently, to enhance the `reusability` and `scalability` of each step, we extract the commonalities of each algorithm at each step into individual templates for templatization.

<p align="center">
<img src="assets/framework.svg" alt="EduStudio Architecture" width="600">
<img src="assets/framework.png" alt="EduStudio Architecture" width="600">
<br>
<b>Figure</b>: Overall Architecture of EduStudio
</p>
@@ -46,7 +51,7 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL'
@@ -55,15 +60,34 @@ run_edustudio(
'cls': 'NCDM',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)

```

To find out which templates are used for a model, we can find in the [Reference Table](https://edustudio.readthedocs.io/en/latest/user_guide/reference_table.html)

## Citation
```
@article{Le WU:198342,
author = {Le WU, Xiangzhi CHEN, Fei LIU, Junsong XIE, Chenao XIA, Zhengtao TAN, Mi TIAN, Jinglong LI, Kun ZHANG, Defu LIAN, Richang HONG, Meng WANG},
title = {EduStudio: towards a unified library for student cognitive modeling},
publisher = {Front. Comput. Sci.},
year = {2025},
journal = {Frontiers of Computer Science},
volume = {19},
number = {8},
eid = {198342},
numpages = {0},
pages = {198342},
keywords = {open-source library;student cognitive modeling;intelligence education},
url = {https://journal.hep.com.cn/fcs/EN/abstract/article_47994.shtml},
doi = {10.1007/s11704-024-40372-3}
}
```


## License

EduStudio uses [MIT License](https://github.com/HFUT-LEC/EduStudio/blob/main/LICENSE).

Binary file added assets/framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,202 changes: 0 additions & 1,202 deletions assets/framework.svg

This file was deleted.

Binary file added docs/source/assets/dataflow.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/assets/framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@
project = 'EduStudio'
copyright = '2023, HFUT-LEC'
author = 'HFUT-LEC'
release = 'v1.0.0-beta2.1'
release = 'v1.1.4'

import sphinx_rtd_theme
import os
8 changes: 4 additions & 4 deletions docs/source/developer_guide/customize_evaltpl.md
Original file line number Diff line number Diff line change
@@ -23,14 +23,14 @@ The protocols in ``BaseEvalTPL`` are listed as follows.
EvalTPLs
----------------------

EduStudio provides ``BinaryClassificationEvalTPL`` and ``CognitiveDiagnosisEvalTPL``, which inherent ``BaseEvalTPL``.
EduStudio provides ``PredictionEvalTPL`` and ``InterpretabilityEvalTPL``, which inherent ``BaseEvalTPL``.

### BinaryClassificationEvalTPL
### PredictionEvalTPL
This EvalTPL is for the model evaluation using binary classification metrics.
The protocols in ``BinaryClassificationEvalTPL`` are listed as follows.
The protocols in ``PredictionEvalTPL`` are listed as follows.


### CognitiveDiagnosisEvalTPL
### InterpretabilityEvalTPL
This EvalTPL is for the model evaluation for interpretability. It uses states of students and Q matrix for ``eval``, which are domain-specific in student assessment.

## Develop a New EvalTPL in EduStudio
14 changes: 7 additions & 7 deletions docs/source/developer_guide/customize_traintpl.md
Original file line number Diff line number Diff line change
@@ -8,19 +8,19 @@ The TrainTPL Protocol is detailed in ``BaseTrainTPL``. The function to start the

## TrainTPLs

By inherenting the TrainTPL Protocol, EduStudio provides the class ``EduStudio.edustudio.traintpl.traintpl.gd_traintpl.GDTrainTPL``(``GDTrainTPL``) and ``EduStudio.edustudio.traintpl.edu_traintpl.EduTrainTPL``(``EduTrainTPL``), which are suitable for most gradient descent optimization-based models and most student evaluation models. ``GDTrainTPL`` inherits ``BaseTrainTPL`` and rewrites ``start()``. The function to get optimizer according to the parameter ``default_cfg.optim`` is ``GDTrainTPL._get_optim()``. The function to obtain loaders of train, val, and test dataset is ``GDTrainTPL.build_loaders()``. ``EduTrainTPL`` inherits ``GDTrainTPL`` and rewrites ``start()``. In the ``EduTrainTPL.start()``, the functions for each dataloader is ``EduTrainTPL.fit()`` .
By inherenting the TrainTPL Protocol, EduStudio provides the class ``EduStudio.edustudio.traintpl.traintpl.gd_traintpl.GDTrainTPL``(``GDTrainTPL``) and ``EduStudio.edustudio.traintpl.edu_traintpl.GeneralTrainTPL``(``GeneralTrainTPL``), which are suitable for most gradient descent optimization-based models and most student evaluation models. ``GDTrainTPL`` inherits ``BaseTrainTPL`` and rewrites ``start()``. The function to get optimizer according to the parameter ``default_cfg.optim`` is ``GDTrainTPL._get_optim()``. The function to obtain loaders of train, val, and test dataset is ``GDTrainTPL.build_loaders()``. ``GeneralTrainTPL`` inherits ``GDTrainTPL`` and rewrites ``start()``. In the ``GeneralTrainTPL.start()``, the functions for each dataloader is ``GeneralTrainTPL.fit()`` .

## Develop a New TrainTPL in EduStudio

If the developed model needs more complex training method, then one can inherent ``BaseTrainTPL`` and revise the function ``start()``. One can also define the configuration of the new training template in the dictionary ``default_cfg``. Similarly, one can inherent ``GDTrainTPL`` and ``EduTrainTPL`` and revise the ``start`` function and ``default_cfg`` dictionary.
If the developed model needs more complex training method, then one can inherent ``BaseTrainTPL`` and revise the function ``start()``. One can also define the configuration of the new training template in the dictionary ``default_cfg``. Similarly, one can inherent ``GDTrainTPL`` and ``GeneralTrainTPL`` and revise the ``start`` function and ``default_cfg`` dictionary.

Example
-------------------------
If you need to modify TrainTPl in the student assessment model so that only ``main_loss`` is used after a certain epoch, then you just need to inherit ``EduTrainTPL``, set the ``epoch_to_change`` parameter in ``default_cfg``.
If you need to modify TrainTPl in the student assessment model so that only ``main_loss`` is used after a certain epoch, then you just need to inherit ``GeneralTrainTPL``, set the ``epoch_to_change`` parameter in ``default_cfg``.

```python
from .edu_traintpl import EduTrainTPL
class NewTrainTPL(EduTrainTPL):
from .edu_traintpl import GeneralTrainTPL
class NewTrainTPL(GeneralTrainTPL):
default_cfg = {
'epoch_to_change': 10,
}
@@ -59,8 +59,8 @@ def fit(self, train_loader, val_loader):
The complete code of example is detailed as follows.

```python
from .edu_traintpl import EduTrainTPL
class NewTrainTPL(EduTrainTPL):
from .edu_traintpl import GeneralTrainTPL
class NewTrainTPL(GeneralTrainTPL):
default_cfg = {
'epoch_to_change': 10,
}
8 changes: 2 additions & 6 deletions docs/source/features/atomic_files.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
# Atomic File Protocol
# Middle Data Format Protocol

In `EduStudio`, we adopt a flexible CSV (Comma-Separated Values) file format following [Recbole](https://recbole.io/atomic_files.html). The flexible CSV format is defined in `middata` stage of dataset (see dataset stage protocol for details).

The atomic file protocol including two parts: `Columns name Format` and `Filename Format`.

**Note**: The atomic files protocol is defined in `Inherited Architecture`. In fact, users can abandon the atomic files protocol by inheriting the data template protocol class in `Basic Architecture`(i.e. `BaseDataTPL`).


The Middle Data Format Protocol including two parts: `Columns name Format` and `Filename Format`.

## Columns Name Format

37 changes: 18 additions & 19 deletions docs/source/features/atomic_operations.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,30 @@
# Atomic Operations
# Atomic Data Operation Protocol

In `Edustudio`, we view the dataset from three stages: `rawdata`, `middata`, `cachedata`.

we treat the whole data processing as multiple atomic operations called atomic operation sequence.
We treat the whole data processing as multiple atomic operations called atomic operation sequence.
The first atomic operation, inheriting the protocol class `BaseRaw2Mid`, is the process from raw data to middle data.
The following atomic operations, inheriting the protocol class `BaseMid2Cache`, construct the process from middle data to cache data.

The atomic operation protocol can be seen at `Atomic Operation Protocol`.

## Partial Atomic Operation Table


## Atomic Operation Table

In the following, we give a table to display existing atomic operations.
In the following, we give a table to display some existing atomic operations. For more detailed Atomic Operation Table, please see the `user_guide/Atomic Data Operation List`

### Raw2Mid

| name | description |
For the conversion from rawdata to middata, we implement a specific atomic data operation prefixed with `R2M` for each dataset.

| name | Corresponding datase |
| --------------- | ------------------------------------------------------------ |
| R2M_ASSIST_0910 | The atomic operation that process the Assistment_0910 dataset from rawdata into midata |
| R2M_FrcSub | The atomic operation that process the FrcSub dataset from rawdata into midata |
| R2M_ASSIST_1213 | The atomic operation that process the Assistment_1213 dataset from rawdata into midata |
| R2M_Math1 | The atomic operation that process the Math1dataset from rawdata into midata |
| R2M_Math2 | The atomic operation that process the Math2 dataset from rawdata into midata |
| R2M_AAAI_2023 | The atomic operation that process the AAAI 2023 challenge dataset from rawdata into midata |
| R2M_Algebra_0506 | The atomic operation that process the Algebra 2005-2006 dataset from rawdata into midata |
| R2M_ASSIST_1516 | The atomic operation that process the Assistment 2015-2016 dataset from rawdata into midata |
| R2M_ASSIST_0910 | ASSISTment 2009-2010 |
| R2M_FrcSub | Frcsub |
| R2M_ASSIST_1213 | ASSISTment 2012-2013 |
| R2M_Math1 | Math1 |
| R2M_Math2 | Math2 |
| R2M_AAAI_2023 | AAAI 2023 Global Knowledge Tracing Challenge |
| R2M_Algebra_0506 | Algebra 2005-2006 |
| R2M_ASSIST_1516 | ASSISTment 2015-2016 |

### Mid2Cache

@@ -50,7 +49,7 @@ In the following, we give a table to display existing atomic operations.
| name | description |
| ---------------------- | ------------------------------------------- |
| M2C_BuildSeqInterFeats | Build Sequential Features and Split dataset |
| M2C_CptAsExer | Treat knowledge concept as exercise |
| M2C_GenCptSeq | Generate knowledge concept seq |
| M2C_GenUnFoldCptSeq | Unfold knowledge concepts |
| M2C_KCAsExer | Treat knowledge concept as exercise |
| M2C_GenKCSeq | Generate knowledge concept seq |
| M2C_GenUnFoldKCSeq | Unfold knowledge concepts |

34 changes: 19 additions & 15 deletions docs/source/features/dataset_folder_protocol.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Dataset Stage Protocol
# Dataset Status Protocol

In `Edustudio`, we view the dataset as three stages: `rawdata`, `middata`, `cachedata`.
In `Edustudio`, we view the dataset as three statuses: `rawdata`, `middata`, `cachedata`.
- inconsistent rawdata: the original data format provided by the dataset publisher.
- standardized middata: the standardized middle data format(see Middle Data Format Protocol) defined by EduStudio.
- model-friendly cachedata: the data format that is convenient for model usage.


## Dataset Folder Format Example
@@ -51,18 +54,18 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "rawdata", # specify the loading stage of the dataset
'load_data_from': "rawdata", # specify the loading stage of the dataset
'raw2mid_op': 'R2M_FrcSub' # specify the R2M atomic operation
},
modeltpl_cfg_dict={
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
```
@@ -78,19 +81,19 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "middata", # specify the loading stage of the dataset
'is_save_cache': True # whether to save cache data
'load_data_from': "middata", # specify the loading stage of the dataset
'is_save_cache': True, # whether to save cache data
'cache_id': 'cache_default', # cache id, valid when is_save_cache=True
},
modeltpl_cfg_dict={
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
```
@@ -107,18 +110,19 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "cachedata", # specify the loading stage of the dataset
'load_data_from': "cachedata", # specify the loading stage of the dataset
'is_save_cache': False,
'cache_id': 'cache_default', # cache id, valid when is_save_cache=True
},
modeltpl_cfg_dict={
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
```
@@ -141,11 +145,11 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "rawdata", # specify the loading stage of the dataset
'load_data_from': "rawdata", # specify the loading stage of the dataset
'raw2mid_op': 'R2M_FrcSub',
# the 'mid2cache_op_seq' option specify the atomic operation sequence
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_FilterRecords4CD', 'M2C_ReMapId', 'M2C_RandomDataSplit4CD', 'M2C_GenQMat'],
@@ -154,7 +158,7 @@ run_edustudio(
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
```
12 changes: 5 additions & 7 deletions docs/source/features/global_cfg_obj.md
Original file line number Diff line number Diff line change
@@ -16,14 +16,12 @@ The description of five config objects is illustrated in Table below.



## Four Entry Points of Configuration
## Four Configuration Portals

There are four entry points of configuration:
There are four configuration portals:

- default_cfg: inheritable class varible
- config file
- parameter dict
- default_cfg: inheritable python class varible
- configuration file
- parameter dictionary
- command line



4 changes: 2 additions & 2 deletions docs/source/features/inheritable_config.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Inheritable Configuration
# Inheritable Default Configuration

The management of default configuration in Edustudio is implemented by class variable, i.e. a dictionary object called default_config.

Templates usually introduce new features through inheritance, and these new features may require corresponding configurations, so the default configuration we provide is inheritable.

## Example

The inheritance example of data template is illustrated as follows:
The inheritance example of data template is illustrated as follows. We present an example in the data preparation procedure. There are three data template classes (DataTPLs) that inherit from each other: BaseDataTPL, GeneralDataTPL, and EduDataTPL. If users specify current DataTPL is EduDataTPL, the eventual default\_config of data preparation procedure is a merger of default\_cfg of three templates. When a configuration conflict is encountered, the default\_config of subclass template takes precedence over that of parent class templates. As a result, other configuration portals (i.e, configuration file, parameter dictionary, and command line) can only specify parameters that are confined within the default configuration. The advantage of the inheritable design is that it facilitates the reader to locate the numerous hyperparameters.

```python
class BaseDataTPL(Dataset):
15 changes: 15 additions & 0 deletions docs/source/features/standard_datamodule.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Standardized Data Module

For data module, we provide a standardized design with three protocols (see following sections for details):
- Data Status Protocol
- Middle Data Format Protocol
- Atomic Operation Protocol

![](../assets/dataflow.jpg)

The first step of Data Templates is to load the raw data from the hard disk. Then, a series of processing steps are performed to obtain model-friendly data objects. Finally, these data objects are passed on to other modules.
We simplify the data preparation into three into three stages:

- Data loading: Loading necessary data from the hard disk.
- Data processing: Convert the raw data into model-friendly data objects by a range of data processing operations.
- Data delivery: Deliver model-friendly data objects to the training, model, and evaluation templates.
4 changes: 2 additions & 2 deletions docs/source/get_started/quick_start.md
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL'
@@ -22,7 +22,7 @@ run_edustudio(
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
```
23 changes: 14 additions & 9 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,22 +1,25 @@
.. EduStudio documentation master file.
.. title:: EduStudio v1.0.0-beta2.1
.. title:: EduStudio v1.1.4
.. image:: assets/logo.png

=========================================================

`HomePage <https://edustudio.ai/>`_ | `Docs <https://edustudio.ai/docs/>`_ | `GitHub <https://github.com/HFUT-LEC/EduStudio>`_
`HomePage <https://edustudio.ai/>`_ | `Docs <https://edustudio.ai/docs/>`_ | `GitHub <https://github.com/HFUT-LEC/EduStudio>`_ | `Paper <https://journal.hep.com.cn/fcs/EN/10.1007/s11704-024-40372-3>`_

Introduction
-------------------------
EduStudio is a Unified and Templatized Framework for Student Assessment Models including Cognitive Diagnosis(CD) and Knowledge Tracing(KT) based on Pytorch.
EduStudio is a Unified Library for Student Assessment Models including Cognitive Diagnosis(CD) and Knowledge Tracing(KT) based on Pytorch.

EduStudio first decomposes the general algorithmic workflow into five steps: ``configuration reading``, ``data processing``, ``model implementation``, ``training control``, and ``result evaluation``. Subsequently, to enhance the ``reusability`` of each step, we extract the commonalities of each algorithm at each step into individual templates for templatization.
EduStudio first decomposes the general algorithmic workflow into six steps: `configuration reading`, `data prepration`, `model implementation`, `training control`, `model evaluation`, and `Log Storage`. Subsequently, to enhance the `reusability` and `scalability` of each step, we extract the commonalities of each algorithm at each step into individual templates for templatization.

As illustrated in the Figure below, to better implement a templatized framework, we implement an ``inheritance-style`` EduStudio that contains basic architecture and inherited architecture with different responsibilities.
- Configuration Reading (Step 1) aims to collect, categorize and deliver configurations from different configuration portals.
- Data Preparation (Step 2) aims to convert raw data from the hard disk into model-friendly data objects.
- Model Implementation (Step 3) refers to the process of implementing the structure of each model and facilitating the reuse of model components.
- Training Control (Step 4) focuses primarily on the training methods of various models.
- Model Evaluation (Step 5) primarily focuses on the implementation of various evaluation metrics.
- Log Storage (Step 6) aims to implement storage specification when store generated data.

The **basic architecture emphasizes domain-irrelevant content and strives to build templatized protocols**.
The **inherited architecture obeys the protocol in the basic architecture and focuses on domain-relevant content**.
The inheritance-style separates domainrelevant and domain-irrelevant content, greatly simplifying framework structure and enhancing ``readability``.
The modularization establishes clear boundaries between various programs in the algorithm pipeline, facilitating the introduction of new content to individual modules and enhancing scalability.

The overall structure is illustrated as follows:

@@ -42,14 +45,16 @@ The overall structure is illustrated as follows:

features/global_cfg_obj
features/inheritable_config
features/atomic_files
features/standard_datamodule
features/dataset_folder_protocol
features/atomic_files
features/atomic_operations

.. toctree::
:maxdepth: 1
:caption: User Guide

user_guide/atom_op
user_guide/datasets
user_guide/models
user_guide/reference_table
21 changes: 21 additions & 0 deletions docs/source/user_guide/atom_op.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# M2C Atomic Data Operation List


| M2C Atomic operation | M2C Atomic Type | Description |
| :------------------------: | --------------- | ------------------------------------------------------------ |
| M2C_Label2Int | Data Cleaning | Binarization for answering response |
| M2C_FilterRecords4CD | Data Cleaning | Filter some students or exercises according specific conditions |
| M2C_FilteringRecordsByAttr | Data Cleaning | Filtering Students without attribute values, Commonly used by Fair Models |
| M2C_ReMapId | Data Conversion | ReMap Column ID |
| M2C_BuildMissingQ | Data Conversion | Build Missing Q-matrix |
| M2C_BuildSeqInterFeats | Data Conversion | Build sample format for Question-based KT |
| M2C_CKCAsExer | Data Conversion | Build sample format for KC-based KT |
| M2C_MergeDividedSplits | Data Conversion | Merge train/valid/test set into one dataframe |
| M2C_RandomDataSplit4CD | Data Partition | Data partitioning for Cognitive Diagnosis |
| M2C_RandomDataSplit4KT | Data Partition | Data partitioning for Knowledge Tracing |
| M2C_GenKCSeq | Data Generation | Generate Knowledge Component Sequence |
| M2C_GenQMat | Data Generation | Generate Q-matrix (i.e, exercise-KC relation) |
| M2C_BuildKCRelation | Data Generation | Build Knowledge Component Relation Graph |
| M2C_GenUnFoldKCSeq | Data Generation | Generate Unfolded Knowledge Component Sequence |
| M2C_FillMissingQ | Data Generation | Fill Missing Q-matrix |

45 changes: 22 additions & 23 deletions docs/source/user_guide/datasets.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,29 @@
# Dataset List

We collect the commonly used datasets and listed them here. The meaning of the fields in the table below is as follows:
- Exercise Text: contain textual information of exercise or not
- Concet Relation: contain relations among knowledge concepts or not (tree or prerequisite)
- Time: contain time for students to start answering questions or not
We have showcased the preprocessed dataset (i.e, provide raw2mid atomic data operation) of EduStudio here. The meaning of the fields in the table below is as follows:

- Auto download: support download `middata` of the dataset or not in EduStudio
- R2M Script: name of script to process the rawdata into middata in EduStudio



| Dataset Name | Exercise Text | Concept Relation | Time | Auto Download | R2M Script Name | Note |
| :----------------------------------------------------------- | :-----------: | :--------------: | :--: | :-----------: | :----------------------- | :----------------------------------------------------------- |
| [FrcSub](http://staff.ustc.edu.cn/~qiliuql/data/math2015.rar) | ✖️ | ✖️ | ✖️ | ✔️ | R2M_FrcSub | |
| [Math1](http://staff.ustc.edu.cn/~qiliuql/data/math2015.rar) | ✖️ | ✖️ | ✖️ | ✔️ | R2M_Math1 | |
| [Math2](http://staff.ustc.edu.cn/~qiliuql/data/math2015.rar) | ✖️ | ✖️ | ✖️ | ✔️ | R2M_Math2 | |
| [AAAI_2023](https://docs.google.com/forms/d/e/1FAIpQLScWjxiXdSMAKBtlPJZm9MsudUG9CQS16lT0GVfajpVj-mWReA/viewform?pli=1) | ✔️ | ✔️(tree) | ✔️ | ✔️ | R2M_AAAI_2023 | [AAAI2023 Global Knowledge Tracing Challenge](https://ai4ed.cc/competitions/aaai2023competition) |
| [ASSISTment_2009-2010](https://drive.google.com/file/d/0B2X0QD6q79ZJUFU1cjYtdGhVNjg/view?resourcekey=0-OyI8ZWxtGSAzhodUIcMf_g) | ✖️ | ✖️ | ✔️ | ✔️ | R2M_ASSIST_0910 | |
| [ASSISTment_2012-2013](https://sites.google.com/site/assistmentsdata/datasets/2012-13-school-data-with-affect) | ✖️ | ✖️ | ✔️ | ✖️ | R2M_ASSIST_1213 | |
| [ASSISTment_2015-2016](https://sites.google.com/site/assistmentsdata/datasets/2015-assistments-skill-builder-data) | ✖️ | ✖️ | ✔️ | ✖️ | R2M_ASSIST_1516 | |
| [ASSISTment_2017](https://sites.google.com/view/assistmentsdatamining/dataset) | ✖️ | ✖️ | ✔️ | ✖️ | R2M_ASSIST_17 | |
| [Algebera_2005-2006](https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp) | ✖️ | ✖️ | ✔️ | ✖️ | R2M_Algebera_0506 | [KDD Cup 2010](https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp) |
| [Algebera_2006-2007](https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp) | ✖️ | ✖️ | ✔️ | ✖️ | R2M_Algebera_0607 | [KDD Cup 2010](https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp) |
| [Bridge2Algebra_2006-2007](https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp) | ✖️ | ✖️ | ✔️ | ✖️ | R2M_Bridge2Algebra_0607 | [KDD Cup 2010](https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp) |
| [Junyi_AreaTopicAsCpt](https://pslcdatashop.web.cmu.edu/Project?id=244) | ✖️ | ✔️(tree) | ✔️ | ✖️ | R2M_Junyi_AreaTopicAsCpt | Area&Topic field as concept |
| [Junyi_ExerAsCpt](https://pslcdatashop.web.cmu.edu/Project?id=244) | ✖️ | ✔️(prerequisite) | ✔️ | ✖️ | R2M_Junyi_ExerAsCpt | Exercice as concept |
| EdNet_KT1 | ✖️ | ✖️ | ✔️ | ✖️ | R2M_EdNet_KT1 | [download1](http://bit.ly/ednet-content), [download2](http://bit.ly/ednet-content) |
| [Eedi_2020_Task1&2](https://dqanonymousdata.blob.core.windows.net/neurips-public/data.zip) | ✖️ | ✔️(tree) | ✔️ | ✖️ | R2M_Eedi_20_T12 | [NeurIPS 2020 Education Challenge: Task1&2](https://eedi.com/projects/neurips-education-challenge) |
| [Eedi_2020_Task3&4](https://dqanonymousdata.blob.core.windows.net/neurips-public/data.zip) | ✔️(images) | ✔️(tree) | ✔️ | ✖️ | R2M_Eedi_20_T34 | [NeurIPS 2020 Education Challenge: Task3&4](https://eedi.com/projects/neurips-education-challenge) |

| Dataset Name | R2M Script Name | Auto Download | Note |
| :----------------------------------------------------------- | :----------------------- | ------------- | :----------------------------------------------------------: |
| [FrcSub](http://staff.ustc.edu.cn/~qiliuql/data/math2015.rar) | R2M_FrcSub | ✔️ | |
| [Math1](http://staff.ustc.edu.cn/~qiliuql/data/math2015.rar) | R2M_Math1 | ✔️ | |
| [Math2](http://staff.ustc.edu.cn/~qiliuql/data/math2015.rar) | R2M_Math2 | ✔️ | |
| [AAAI_2023](https://docs.google.com/forms/d/e/1FAIpQLScWjxiXdSMAKBtlPJZm9MsudUG9CQS16lT0GVfajpVj-mWReA/viewform?pli=1) | R2M_AAAI_2023 | ✔️ | [AAAI2023 Global Knowledge Tracing Challenge](https://ai4ed.cc/competitions/aaai2023competition) |
| [ASSISTment_2009-2010](https://drive.google.com/file/d/0B2X0QD6q79ZJUFU1cjYtdGhVNjg/view?resourcekey=0-OyI8ZWxtGSAzhodUIcMf_g) | R2M_ASSIST_0910 | ✔️ | |
| [ASSISTment_2012-2013](https://sites.google.com/site/assistmentsdata/datasets/2012-13-school-data-with-affect) | R2M_ASSIST_1213 | ✖️ | |
| [ASSISTment_2015-2016](https://sites.google.com/site/assistmentsdata/datasets/2015-assistments-skill-builder-data) | R2M_ASSIST_1516 | ✖️ | |
| [ASSISTment_2017](https://sites.google.com/view/assistmentsdatamining/dataset) | R2M_ASSIST_17 | ✖️ | |
| [Algebera_2005-2006](https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp) | R2M_Algebera_0506 | ✖️ | [KDD Cup 2010](https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp) |
| [Algebera_2006-2007](https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp) | R2M_Algebera_0607 | ✖️ | [KDD Cup 2010](https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp) |
| [Bridge2Algebra_2006-2007](https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp) | R2M_Bridge2Algebra_0607 | ✖️ | [KDD Cup 2010](https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp) |
| [Junyi_AreaTopicAsCpt](https://pslcdatashop.web.cmu.edu/Project?id=244) | R2M_Junyi_AreaTopicAsCpt | ✖️ | Area&Topic field as concept |
| [Junyi_ExerAsCpt](https://pslcdatashop.web.cmu.edu/Project?id=244) | R2M_Junyi_ExerAsCpt | ✖️ | Exercice as concept |
| EdNet_KT1 | R2M_EdNet_KT1 | ✖️ | [download1](http://bit.ly/ednet-content), [download2](http://bit.ly/ednet-content) |
| [Eedi_2020_Task1&2](https://dqanonymousdata.blob.core.windows.net/neurips-public/data.zip) | R2M_Eedi_20_T12 | ✖️ | [NeurIPS 2020 Education Challenge: Task1&2](https://eedi.com/projects/neurips-education-challenge) |
| [Eedi_2020_Task3&4](https://dqanonymousdata.blob.core.windows.net/neurips-public/data.zip) | R2M_Eedi_20_T34 | ✖️ | [NeurIPS 2020 Education Challenge: Task3&4](https://eedi.com/projects/neurips-education-challenge) |
| [SLP-English](https://aic-fe.bnu.edu.cn/en/data/index.html) | R2M_SLP_English | ✔️ | [[paper](https://aic-fe.bnu.edu.cn/fj/2021-ICCE-SLP.pdf)\], Smart Learning Partner |
| [SLP-Math](https://aic-fe.bnu.edu.cn/en/data/index.html) | R2M_SLP_Math | ✔️ | [[paper](https://aic-fe.bnu.edu.cn/fj/2021-ICCE-SLP.pdf)\], Smart Learning Partner |
132 changes: 54 additions & 78 deletions docs/source/user_guide/models.md

Large diffs are not rendered by default.

91 changes: 47 additions & 44 deletions docs/source/user_guide/reference_table.md
Original file line number Diff line number Diff line change
@@ -4,52 +4,55 @@

| Model | DataTPL | TrainTPL | EvalTPL |
| :------ | ---------------------: | :-------------: | ------------------------------------------------------ |
| IRT | CDInterDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| MIRT | CDInterDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| NCDM | CDInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL、CognitiveDiagnosisEvalTPL |
| CNCD_Q | CNCDQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| CNCD_F | CNCDFDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DINA | CDInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL、CognitiveDiagnosisEvalTPL |
| HierCDF | HierCDFDataTPL | EduTrainTPL | BinaryClassificationEvalTPL、CognitiveDiagnosisEvalTPL |
| CDGK | CDGKDataTPL | EduTrainTPL | BinaryClassificationEvalTPL、CognitiveDiagnosisEvalTPL |
| CDMFKC | CDInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| ECD | ECDDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| IRR | IRRDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| KaNCD | CDInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL、CognitiveDiagnosisEvalTPL |
| KSCD | CDInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| MGCD | MGCDDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| RCD | RCDDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| IRT | CDInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| MIRT | CDInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| MF | CDInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| NCDM | CDInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL, InterpretabilityEvalTPL, IdentifiabilityEvalTPL |
| CNCD_Q | CNCDQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| CNCD_F | CNCDFDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DINA | CDInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL, InterpretabilityEvalTPL, IdentifiabilityEvalTPL |
| HierCDF | HierCDFDataTPL | GeneralTrainTPL | PredictionEvalTPL, InterpretabilityEvalTPL, IdentifiabilityEvalTPL |
| CDGK | CDGKDataTPL | GeneralTrainTPL | PredictionEvalTPL, InterpretabilityEvalTPL, IdentifiabilityEvalTPL |
| CDMFKC | CDInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| ECD | ECDDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| IRR | IRRDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| KaNCD | CDInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL, InterpretabilityEvalTPL, IdentifiabilityEvalTPL |
| KSCD | CDInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| MGCD | MGCDDataTPL | GroupCDTrainTPL | PredictionEvalTPL |
| RCD | RCDDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DCD | DCDDataTPL | DCDTrainTPL | PredictionEvalTPL, InterpretabilityEvalTPL, IdentifiabilityEvalTPL |
| FairCD | FAIRDataTPL | AdversarialTrainTPL | PredictionEvalTPL, FairnessEvalTPL |

## KT models

| Model | DataTPL | TrainTPL | EvalTPL |
| :----------- | ----------------------: | :-------------: | --------------------------- |
| AKT | KTInterDataTPLCptUnfold | EduTrainTPL | BinaryClassificationEvalTPL |
| ATKT | KTInterDataTPLCptUnfold | AtktTrainTPL | BinaryClassificationEvalTPL |
| CKT | KTInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| CL4KT | CL4KTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| CT_NCM | KTInterCptUnfoldDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DeepIRT | KTInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DIMKT | DIMKTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DKT | KTInterDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DKTDSC | DKTDSCDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DKTForget | DKTForgetDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DKT_plus | KTInterDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DKVMN | KTInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| DTransformer | KTInterCptUnfoldDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| EERNN | EERNNDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| EKT | EKTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| GKT | GKTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| HawkesKT | KTInterCptUnfoldDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| IEKT | KTInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| KQN | KTInterDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| LPKT | LPKTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| LPKT_S | LPKTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| QDKT | QDKTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| QIKT | KTInterExtendsQDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| RKT | RKTDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| SAINT | KTInterCptUnfoldDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| SAINT_plus | KTInterCptUnfoldDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| SAKT | KTInterDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| SimpleKT | KTInterCptUnfoldDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| SKVMN | KTInterDataTPL | EduTrainTPL | BinaryClassificationEvalTPL |
| AKT | KTInterCptUnfoldDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| ATKT | KTInterCptUnfoldDataTPL | AtktTrainTPL | PredictionEvalTPL |
| CKT | KTInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| CL4KT | CL4KTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| CT_NCM | KTInterCptUnfoldDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DeepIRT | KTInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DIMKT | DIMKTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DKT | KTInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DKTDSC | DKTDSCDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DKTForget | DKTForgetDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DKT_plus | KTInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DKVMN | KTInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| DTransformer | KTInterCptUnfoldDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| EERNN | EERNNDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| EKT | EKTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| GKT | GKTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| HawkesKT | KTInterCptUnfoldDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| IEKT | KTInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| KQN | KTInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| LPKT | LPKTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| LPKT_S | LPKTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| QDKT | QDKTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| QIKT | KTInterExtendsQDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| RKT | RKTDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| SAINT | KTInterCptUnfoldDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| SAINT_plus | KTInterCptUnfoldDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| SAKT | KTInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| SimpleKT | KTInterCptUnfoldDataTPL | GeneralTrainTPL | PredictionEvalTPL |
| SKVMN | KTInterDataTPL | GeneralTrainTPL | PredictionEvalTPL |
36 changes: 25 additions & 11 deletions docs/source/user_guide/usage/aht.md
Original file line number Diff line number Diff line change
@@ -15,11 +15,14 @@ Here we list two demos for `Ray.Tune` and `HyperOpt`.
## Ray.Tune

```python
# run following after installed edustudio

from edustudio.quickstart import run_edustudio
from ray import tune
import ray
ray.init(num_cpus=4, num_gpus=1)

from edustudio.utils.common import IDUtil as idUtil
import uuid

def deliver_cfg(args):
g_args = {
@@ -33,6 +36,9 @@ def deliver_cfg(args):
g, k = k.split(".")
assert g in g_args
g_args[g][k] = v
g_args['frame_cfg'] = {
'ID': idUtil.get_random_id_bytime() + str(uuid.uuid4()).split("-")[-1]
}
return g_args


@@ -54,20 +60,21 @@ def objective_function(args):


search_space= {
'traintpl_cfg.cls': tune.grid_search(['EduTrainTPL']),
'traintpl_cfg.cls': tune.grid_search(['GeneralTrainTPL']),
'datatpl_cfg.cls': tune.grid_search(['CDInterExtendsQDataTPL']),
'modeltpl_cfg.cls': tune.grid_search(['KaNCD']),
'evaltpl_cfg.clses': tune.grid_search([['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL']]),
'evaltpl_cfg.clses': tune.grid_search([['PredictionEvalTPL', 'InterpretabilityEvalTPL']]),


'traintpl_cfg.batch_size': tune.grid_search([256,]),
'traintpl_cfg.epoch_num': tune.grid_search([2]),
'traintpl_cfg.device': tune.grid_search(["cpu"]),
'modeltpl_cfg.emb_dim': tune.grid_search([20,40])
'traintpl_cfg.device': tune.grid_search(["cuda:0"]),
'modeltpl_cfg.emb_dim': tune.grid_search([20,40]),
'frame_cfg.DISABLE_LOG_STDOUT': tune.grid_search([False]),
}

tuner = tune.Tuner(
objective_function, param_space=search_space, tune_config=tune.TuneConfig(max_concurrent_trials=1)
tune.with_resources(objective_function, {"gpu": 1}), param_space=search_space, tune_config=tune.TuneConfig(max_concurrent_trials=1),
)
results = tuner.fit()

@@ -78,23 +85,29 @@ print(results.get_best_result(metric="auc", mode="max").config)
## HyperOpt

```python
import sys
import os

from edustudio.quickstart import run_edustudio
from hyperopt import hp
from hyperopt import fmin, tpe, space_eval

from edustudio.utils.common import IDUtil as idUtil
import uuid

def deliver_cfg(args):
g_args = {
'traintpl_cfg': {},
'datatpl_cfg': {},
'modeltpl_cfg': {},
'evaltpl_cfg': {},
'frame_cfg': {},
}
for k,v in args.items():
g, k = k.split(".")
assert g in g_args
g_args[g][k] = v
g_args['frame_cfg'] = {
'ID': idUtil.get_random_id_bytime() + str(uuid.uuid4()).split("-")[-1]
}
return g_args


@@ -109,16 +122,16 @@ def objective_function(args):
modeltpl_cfg_dict=g_args['modeltpl_cfg'],
evaltpl_cfg_dict=g_args['evaltpl_cfg'],
frame_cfg_dict=g_args['frame_cfg'],
return_cfg_and_result=True
return_cfg_and_result=True,
)
return res['auc']


space = {
'traintpl_cfg.cls': hp.choice('traintpl_cfg.cls', ['EduTrainTPL']),
'traintpl_cfg.cls': hp.choice('traintpl_cfg.cls', ['GeneralTrainTPL']),
'datatpl_cfg.cls': hp.choice('datapl_cfg.cls', ['CDInterExtendsQDataTPL']),
'modeltpl_cfg.cls': hp.choice('modeltpl_cfg.cls', ['KaNCD']),
'evaltpl_cfg.clses': hp.choice('evaltpl_cfg.clses', [['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL']]),
'evaltpl_cfg.clses': hp.choice('evaltpl_cfg.clses', [['PredictionEvalTPL', 'InterpretabilityEvalTPL']]),


'traintpl_cfg.batch_size': hp.choice('traintpl_cfg.batch_size', [256,]),
@@ -131,4 +144,5 @@ best = fmin(objective_function, space, algo=tpe.suggest, max_evals=10, verbose=F
print("=="*10)
print(best)
print(space_eval(space, best))

```
8 changes: 4 additions & 4 deletions docs/source/user_guide/usage/run_edustudio.md
Original file line number Diff line number Diff line change
@@ -11,7 +11,7 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL'
@@ -20,7 +20,7 @@ run_edustudio(
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
```
@@ -48,14 +48,14 @@ datatpl_cfg:
cls: CDInterDataTPL

traintpl_cfg:
cls: EduTrainTPL
cls: GeneralTrainTPL
batch_size: 512

modeltpl_cfg:
cls: NCDM

evaltpl_cfg:
clses: [BinaryClassificationEvalTPL, CognitiveDiagnosisEvalTPL]
clses: [PredictionEvalTPL, InterpretabilityEvalT]
```
then, run command:
14 changes: 7 additions & 7 deletions docs/source/user_guide/usage/use_case_of_config.md
Original file line number Diff line number Diff line change
@@ -22,7 +22,7 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
@@ -34,15 +34,15 @@ run_edustudio(
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
```

## Q2: How to specify the config of evaluate template
The default_cfg of `BinaryClassificationEvalTPL` is as follows:
The default_cfg of `PredictionEvalTPL` is as follows:
```python
class BinaryClassificationEvalTPL(BaseEvalTPL):
class PredictionEvalTPL(BaseEvalTPL):
default_cfg = {
'use_metrics': ['auc', 'acc', 'rmse']
}
@@ -58,7 +58,7 @@ run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'EduTrainTPL',
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
@@ -70,8 +70,8 @@ run_edustudio(
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['BinaryClassificationEvalTPL', 'CognitiveDiagnosisEvalTPL'],
'CognitiveDiagnosisEvalTPL': {
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
'InterpretabilityEvalTPL': {
'use_metrics': {"auc"} # look here
}
}
2 changes: 1 addition & 1 deletion edustudio/__init__.py
Original file line number Diff line number Diff line change
@@ -2,4 +2,4 @@
from __future__ import print_function
from __future__ import division

__version__ = 'v1.0.0-beta2.1'
__version__ = 'v1.1.4'
15 changes: 9 additions & 6 deletions edustudio/assets/datasets.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# 1. all datasets are stored in https://huggingface.co/datasets/lmcRS/edustudio-datasets
# 2. some datasets may not list here, but can still download, as edustudio will look up from external yaml file: https://huggingface.co/datasets/lmcRS/edustudio-datasets/raw/main/datasets.yaml

ASSIST_0910:
middata_url: https://gitlab.com/hfut-lec/edudatafiles/-/raw/main/ASSIST_0910/ASSIST_0910-middata.zip
middata_url: https://huggingface.co/datasets/lmcRS/edustudio-datasets/resolve/main/ASSIST_0910/ASSIST_0910-middata.zip
FrcSub:
middata_url: https://gitlab.com/hfut-lec/edudatafiles/-/raw/main/FrcSub/FrcSub-middata.zip
middata_url: https://huggingface.co/datasets/lmcRS/edustudio-datasets/resolve/main/FrcSub/FrcSub-middata.zip
Math1:
middata_url: https://gitlab.com/hfut-lec/edudatafiles/-/raw/main/Math1/Math1-middata.zip
middata_url: https://huggingface.co/datasets/lmcRS/edustudio-datasets/resolve/main/Math1/Math1-middata.zip
Math2:
middata_url: https://gitlab.com/hfut-lec/edudatafiles/-/raw/main/Math2/Math2-middata.zip
middata_url: https://huggingface.co/datasets/lmcRS/edustudio-datasets/resolve/main/Math2/Math2-middata.zip
AAAI_2023:
middata_url: https://gitlab.com/hfut-lec/edudatafiles/-/raw/main/AAAI_2023/AAAI_2023-middata.zip
middata_url: https://huggingface.co/datasets/lmcRS/edustudio-datasets/resolve/main/AAAI_2023/AAAI_2023-middata.zip
PISA_2015_ECD:
middata_url: https://gitlab.com/hfut-lec/edudatafiles/-/raw/main/PISA_2015_ECD/PISA_2015_ECD-middata.zip
middata_url: https://huggingface.co/datasets/lmcRS/edustudio-datasets/resolve/main/PISA_2015_ECD/PISA_2015_ECD-middata.zip
2 changes: 0 additions & 2 deletions edustudio/atom_op/mid2cache/CD/data_split4cd.py
Original file line number Diff line number Diff line change
@@ -104,5 +104,3 @@ def set_dt_info(self, dt_info, **kwargs):
dt_info['cpt_count'] = max(dt_info.get('cpt_count', -1), df[col].max() + 1)
else:
dt_info['cpt_count'] = max(dt_info.get('cpt_count', -1), np.max(list(chain(*df[col].to_list()))) + 1)

a = 1
7 changes: 4 additions & 3 deletions edustudio/atom_op/mid2cache/KT/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .build_seq_inter_feats import M2C_BuildSeqInterFeats
from .cpt_as_exer import M2C_CptAsExer
from .gen_cpt_seq import M2C_GenCptSeq
from .gen_unfold_cpt_seq import M2C_GenUnFoldCptSeq
from .cpt_as_exer import M2C_KCAsExer
from .gen_cpt_seq import M2C_GenKCSeq
from .gen_unfold_cpt_seq import M2C_GenUnFoldKCSeq
from .data_split4kt import M2C_RandomDataSplit4KT
116 changes: 14 additions & 102 deletions edustudio/atom_op/mid2cache/KT/build_seq_inter_feats.py
Original file line number Diff line number Diff line change
@@ -7,13 +7,10 @@

class M2C_BuildSeqInterFeats(BaseMid2Cache):
default_cfg = {
'seed': 2023,
'divide_by': 'stu',
'window_size': 100,
"divide_scale_list": [7,1,2],
"extra_inter_feats": []
}

def __init__(self, m2c_cfg, n_folds, is_dataset_divided) -> None:
super().__init__(m2c_cfg)
self.n_folds = n_folds
@@ -25,11 +22,7 @@ def from_cfg(cls, cfg):
n_folds = cfg.datatpl_cfg.n_folds
is_dataset_divided = cfg.datatpl_cfg.is_dataset_divided
return cls(m2c_cfg, n_folds, is_dataset_divided)

def _check_params(self):
super()._check_params()
assert self.m2c_cfg['divide_by'] in {'stu', 'time'}


def process(self, **kwargs):
df = kwargs['df']
df_train, df_valid, df_test = kwargs['df_train'], kwargs['df_valid'], kwargs['df_test']
@@ -40,96 +33,36 @@ def process(self, **kwargs):

if not self.is_dataset_divided:
assert df_train is None and df_valid is None and df_test is None
if self.m2c_cfg['divide_by'] == 'stu':
if self.n_folds == 1:
train_dict, valid_dict, test_dict = self._divide_data_df_by_stu_one_fold(df)
kwargs['df_train_folds'] = [train_dict]
kwargs['df_valid_folds'] = [valid_dict]
kwargs['df_test_folds'] = [test_dict]
else:
kwargs['df_train_folds'], kwargs['df_valid_folds'], kwargs['df_test_folds'] = self._divide_data_df_by_stu_multi_fold(df)
elif self.m2c_cfg['divide_by'] == 'time':
raise NotImplementedError
else:
raise ValueError(f"unknown divide_by: {self.m2c_cfg['divide_by']}")
self.window_size = self.m2c_cfg['window_size']
if self.m2c_cfg['window_size'] <= 0 or self.m2c_cfg['window_size'] is None:
self.window_size = df[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max()
self.logger.info(f"actual window size: {self.window_size}")
kwargs['df_seq'] = self.construct_df2dict(df)

else: # dataset is divided
assert df_train is not None and df_test is not None
if self.m2c_cfg['window_size'] <= 0 or self.m2c_cfg['window_size'] is None:
self.window_size = np.max([
df_train[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max(),
df_valid[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max() if df_valid is not None else 0,
df_valid[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max()
df_test[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max()
])
self.logger.info(f"actual window size: {self.window_size}")
else:
self.window_size = self.m2c_cfg['window_size']
self.logger.info(f"actual window size: {self.window_size}")

train_dict = self.construct_df2dict(df_train)
valid_dict = self.construct_df2dict(df_valid)
test_dict = self.construct_df2dict(df_test)
kwargs['df_train_folds'] = [train_dict]
kwargs['df_valid_folds'] = [valid_dict]
kwargs['df_test_folds'] = [test_dict]
kwargs['df_train_seq'] = train_dict
kwargs['df_valid_seq'] = valid_dict
kwargs['df_test_seq'] = test_dict
return kwargs

@staticmethod
def sort_records(df, col='order_id:token'):
if df is not None:
return df.sort_values(by=col, ascending=True).reset_index(drop=True)

def _divide_data_df_by_stu_one_fold(self, df: pd.DataFrame):
train_stu_id, val_stu_id, test_stu_id = SpliterUtil.divide_data_df_one_fold(
df['stu_id:token'].drop_duplicates(), seed=self.m2c_cfg['seed'], shuffle=True,
divide_scale_list=self.m2c_cfg['divide_scale_list']
)
train_df = df[df['stu_id:token'].isin(train_stu_id)]
val_df = df[df['stu_id:token'].isin(val_stu_id)] if val_stu_id is not None else None
test_df = df[df['stu_id:token'].isin(test_stu_id)]

if self.m2c_cfg['window_size'] <= 0 or self.m2c_cfg['window_size'] is None:
self.window_size = np.max([
train_df[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max(),
val_df[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max() if val_df is not None else 0,
test_df[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max()
])
self.logger.info(f"actual window size: {self.window_size}")
else:
self.window_size = self.m2c_cfg['window_size']

train_dict = self.construct_df2dict(train_df)
val_dict = self.construct_df2dict(val_df)
test_dict = self.construct_df2dict(test_df)
return train_dict, val_dict, test_dict

def _divide_data_df_by_stu_multi_fold(self, df: pd.DataFrame):
res = SpliterUtil.divide_data_df_one_fold(
df['stu_id:token'].drop_duplicates(), seed=self.m2c_cfg['seed'], shuffle=True,
divide_scale_list=self.m2c_cfg['divide_scale_list']
)

train_list, valid_list, test_list = [], [], []
for train_stu_id, val_stu_id, test_stu_id in zip(res):
train_df = df[df['stu_id:token'].isin(train_stu_id)]
val_df = df[df['stu_id:token'].isin(val_stu_id)] if val_stu_id is not None else None
test_df = df[df['stu_id:token'].isin(test_stu_id)]

if self.m2c_cfg['window_size'] <= 0 or self.m2c_cfg['window_size'] is None:
self.window_size = np.max([
train_df[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max(),
val_df[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max() if val_df is not None else 0,
test_df[['stu_id:token', 'exer_id:token']].groupby('stu_id:token').agg('count')['exer_id:token'].max()
])
self.logger.info(f"actual window size: {self.window_size}")
else:
self.window_size = self.m2c_cfg['window_size']

train_dict = self.construct_df2dict(train_df)
valid_dict = self.construct_df2dict(val_df)
test_dict = self.construct_df2dict(test_df)
train_list.append(train_dict)
valid_list.append(valid_dict)
test_list.append(test_dict)

return train_list, valid_list, test_list

def construct_df2dict(self, df: pd.DataFrame):
if df is None: return None
@@ -170,24 +103,3 @@ def construct_df2dict(self, df: pd.DataFrame):
raise NotImplementedError

return ret_dict

def set_dt_info(self, dt_info, **kwargs):
dt_info['real_window_size'] = self.window_size
if not self.is_dataset_divided:
if 'stu_id:token' in kwargs['df'].columns:
dt_info['stu_count'] = int(kwargs['df']['stu_id:token'].max() + 1)
if 'exer_id:token' in kwargs['df'].columns:
dt_info['exer_count'] = int(kwargs['df']['exer_id:token'].max() + 1)
else:
stu_count = max(kwargs['df_train']['stu_id:token'].max() + 1, kwargs['df_test']['stu_id:token'].max() + 1)
stu_count = max(kwargs['df_valid']['stu_id:token'].max() + 1, stu_count) if 'df_valid' in kwargs else stu_count

exer_count = max(kwargs['df_train']['exer_id:token'].max() + 1, kwargs['df_test']['exer_id:token'].max() + 1)
exer_count = max(kwargs['df_valid']['exer_id:token'].max() + 1, exer_count) if 'df_valid' in kwargs else exer_count

dt_info['stu_count'] = stu_count
dt_info['exer_count'] = exer_count

if kwargs.get('df_exer', None) is not None:
if 'cpt_seq:token_seq' in kwargs['df_exer']:
dt_info['cpt_count'] = len(set(list(chain(*kwargs['df_exer']['cpt_seq:token_seq'].to_list()))))
4 changes: 3 additions & 1 deletion edustudio/atom_op/mid2cache/KT/cpt_as_exer.py
Original file line number Diff line number Diff line change
@@ -3,7 +3,9 @@
from itertools import chain


class M2C_CptAsExer(BaseMid2Cache):
class M2C_KCAsExer(BaseMid2Cache):
"""Knowledge Concept As Exercise
"""
default_cfg = {}

def process(self, **kwargs):
112 changes: 112 additions & 0 deletions edustudio/atom_op/mid2cache/KT/data_split4kt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
from ..common.base_mid2cache import BaseMid2Cache
import pandas as pd
import numpy as np
from edustudio.datatpl.utils import SpliterUtil, PadSeqUtil
from itertools import chain


class M2C_RandomDataSplit4KT(BaseMid2Cache):
default_cfg = {
'seed': 2023,
'divide_by': 'stu',
"divide_scale_list": [7,1,2],
}

def __init__(self, m2c_cfg, n_folds, is_dataset_divided) -> None:
super().__init__(m2c_cfg)
self.n_folds = n_folds
self.is_dataset_divided = is_dataset_divided

@classmethod
def from_cfg(cls, cfg):
m2c_cfg = cfg.datatpl_cfg.get(cls.__name__)
n_folds = cfg.datatpl_cfg.n_folds
is_dataset_divided = cfg.datatpl_cfg.is_dataset_divided
return cls(m2c_cfg, n_folds, is_dataset_divided)

def _check_params(self):
super()._check_params()
assert self.m2c_cfg['divide_by'] in {'stu', 'time'}

def process(self, **kwargs):
df_seq = kwargs['df_seq']
df_train_seq = kwargs.get('df_train_seq', None)
df_valid_seq = kwargs.get('df_validn_seq', None)
df_test_seq = kwargs.get('df_test_seq', None)

if not self.is_dataset_divided:
assert df_train_seq is None and df_valid_seq is None and df_test_seq is None
self.window_size = df_seq['exer_seq:token_seq'].shape[1]
if self.m2c_cfg['divide_by'] == 'stu':
if self.n_folds == 1:
train_dict, valid_dict, test_dict = self._divide_data_df_by_stu_one_fold(df_seq)
kwargs['df_train_folds'] = [train_dict]
kwargs['df_valid_folds'] = [valid_dict]
kwargs['df_test_folds'] = [test_dict]
else:
kwargs['df_train_folds'], kwargs['df_valid_folds'], kwargs['df_test_folds'] = self._divide_data_df_by_stu_multi_fold(df_seq)
elif self.m2c_cfg['divide_by'] == 'time':
raise NotImplementedError
else:
raise ValueError(f"unknown divide_by: {self.m2c_cfg['divide_by']}")
else:
assert df_train_seq is not None and df_test_seq is not None
self.window_size = df_train_seq['exer_seq:token_seq'].shape[1]
kwargs['df_train_folds'] = [df_train_seq]
kwargs['df_valid_folds'] = [df_valid_seq]
kwargs['df_test_folds'] = [df_test_seq]
return kwargs

def _dict_index_flag(self, df_seq:dict, flag: np.array):
return {
k: df_seq[k][flag] for k in df_seq
}

def _divide_data_df_by_stu_one_fold(self, df_seq: dict):
train_stu_id, valid_stu_id, test_stu_id = SpliterUtil.divide_data_df_one_fold(
pd.DataFrame({"stu_id:token": np.unique(df_seq['stu_id:token'])}), seed=self.m2c_cfg['seed'], shuffle=True,
divide_scale_list=self.m2c_cfg['divide_scale_list']
)

df_train_seq = self._dict_index_flag(df_seq, np.isin(df_seq['stu_id:token'], train_stu_id.to_numpy().flatten()))
df_test_seq = self._dict_index_flag(df_seq, np.isin(df_seq['stu_id:token'], test_stu_id.to_numpy().flatten()))
df_valid_seq = None
if valid_stu_id is not None:
df_valid_seq = self._dict_index_flag(df_seq, np.isin(df_seq['stu_id:token'], valid_stu_id.to_numpy().flatten()))

return df_train_seq, df_test_seq, df_valid_seq

def _divide_data_df_by_stu_multi_fold(self, df_seq: pd.DataFrame):
res = SpliterUtil.divide_data_df_multi_folds(
pd.DataFrame({"stu_id:token": np.unique(df_seq['stu_id:token'])}), seed=self.m2c_cfg['seed'], shuffle=True, n_folds=self.n_folds
)

train_list, test_list = [], []
for (train_stu_id, test_stu_id) in zip(*res):
df_train_seq = self._dict_index_flag(df_seq, np.isin(df_seq['stu_id:token'], train_stu_id.to_numpy().flatten()))
df_test_seq = self._dict_index_flag(df_seq, np.isin(df_seq['stu_id:token'], test_stu_id.to_numpy().flatten()))
train_list.append(df_train_seq)
test_list.append(df_test_seq)

return train_list, [], test_list

def set_dt_info(self, dt_info, **kwargs):
dt_info['real_window_size'] = self.window_size
if not self.is_dataset_divided:
if 'stu_id:token' in kwargs['df'].columns:
dt_info['stu_count'] = int(kwargs['df']['stu_id:token'].max() + 1)
if 'exer_id:token' in kwargs['df'].columns:
dt_info['exer_count'] = int(kwargs['df']['exer_id:token'].max() + 1)
else:
stu_count = max(kwargs['df_train']['stu_id:token'].max() + 1, kwargs['df_test']['stu_id:token'].max() + 1)
stu_count = max(kwargs['df_valid']['stu_id:token'].max() + 1, stu_count) if 'df_valid' in kwargs else stu_count

exer_count = max(kwargs['df_train']['exer_id:token'].max() + 1, kwargs['df_test']['exer_id:token'].max() + 1)
exer_count = max(kwargs['df_valid']['exer_id:token'].max() + 1, exer_count) if 'df_valid' in kwargs else exer_count

dt_info['stu_count'] = stu_count
dt_info['exer_count'] = exer_count

if kwargs.get('df_exer', None) is not None:
if 'cpt_seq:token_seq' in kwargs['df_exer']:
dt_info['cpt_count'] = len(set(list(chain(*kwargs['df_exer']['cpt_seq:token_seq'].to_list()))))
4 changes: 3 additions & 1 deletion edustudio/atom_op/mid2cache/KT/gen_cpt_seq.py
Original file line number Diff line number Diff line change
@@ -3,7 +3,9 @@
from edustudio.datatpl.utils import PadSeqUtil


class M2C_GenCptSeq(BaseMid2Cache):
class M2C_GenKCSeq(BaseMid2Cache):
"""Generate Knowledge Component Sequence
"""
default_cfg = {
'cpt_seq_window_size': -1,
}
2 changes: 1 addition & 1 deletion edustudio/atom_op/mid2cache/KT/gen_unfold_cpt_seq.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@
import pandas as pd


class M2C_GenUnFoldCptSeq(BaseMid2Cache):
class M2C_GenUnFoldKCSeq(BaseMid2Cache):
default_cfg = {}

def __init__(self, m2c_cfg, n_folds, is_dataset_divided) -> None:
5 changes: 4 additions & 1 deletion edustudio/atom_op/mid2cache/common/__init__.py
Original file line number Diff line number Diff line change
@@ -3,4 +3,7 @@
from .label2int import M2C_Label2Int
from .merge_divided_splits import M2C_MergeDividedSplits
from .remapid import M2C_ReMapId
from .build_cpt_relation import M2C_BuildCptRelation
from .build_cpt_relation import M2C_BuildKCRelation
from .build_missing_Q import M2C_BuildMissingQ
from .fill_missing_Q import M2C_FillMissingQ
from .filtering_records_by_attr import M2C_FilteringRecordsByAttr
2 changes: 1 addition & 1 deletion edustudio/atom_op/mid2cache/common/build_cpt_relation.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@
from itertools import chain


class M2C_BuildCptRelation(BaseMid2Cache):
class M2C_BuildKCRelation(BaseMid2Cache):
default_cfg = {
'relation_type': 'rcd_transition',
'threshold': None
10 changes: 0 additions & 10 deletions edustudio/atom_op/mid2cache/common/build_dtinfo.py

This file was deleted.

68 changes: 68 additions & 0 deletions edustudio/atom_op/mid2cache/common/build_missing_Q.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
from .base_mid2cache import BaseMid2Cache
import numpy as np
import pandas as pd
from itertools import chain
import torch
from edustudio.utils.common import set_same_seeds


class M2C_BuildMissingQ(BaseMid2Cache):
default_cfg = {
'seed': 20230518,
'Q_delete_ratio': 0.0,
}

def process(self, **kwargs):
dt_info = kwargs['dt_info']
self.item_count = dt_info['exer_count']
self.cpt_count = dt_info['cpt_count']
self.df_Q = kwargs['df_exer'][['exer_id:token', 'cpt_seq:token_seq']]

self.missing_df_Q = self.get_missing_df_Q()
self.missing_Q_mat = self.get_Q_mat_from_df_arr(self.missing_df_Q, self.item_count, self.cpt_count)

kwargs['missing_df_Q'] = self.missing_df_Q
kwargs['missing_Q_mat'] = self.missing_Q_mat

return kwargs

def get_missing_df_Q(self):
set_same_seeds(seed=self.m2c_cfg['seed'])
ratio = self.m2c_cfg['Q_delete_ratio']
iid2cptlist = self.df_Q.set_index('exer_id:token')['cpt_seq:token_seq'].to_dict()
iid_lis = np.array(list(chain(*[[i]*len(iid2cptlist[i]) for i in iid2cptlist])))
cpt_lis = np.array(list(chain(*list(iid2cptlist.values()))))
entry_arr = np.vstack([iid_lis, cpt_lis]).T

np.random.shuffle(entry_arr)

# reference: https://stackoverflow.com/questions/64834655/python-how-to-find-first-duplicated-items-in-an-numpy-array
_, idx = np.unique(entry_arr[:, 1], return_index=True) # 先从每个知识点中选出1题出来
bool_idx = np.zeros_like(entry_arr[:, 1], dtype=bool)
bool_idx[idx] = True
preserved_exers = np.unique(entry_arr[bool_idx, 0]) # 选择符合条件的习题作为保留

delete_num = int(ratio * self.item_count)
preserved_num = self.item_count - delete_num

if len(preserved_exers) >= preserved_num:
self.logger.warning(
f"Cant Satisfy Delete Require: {len(preserved_exers)=},{preserved_num=}"
)
else:
need_preserved_num = preserved_num - len(preserved_exers)

left_iids = np.arange(self.item_count)
left_iids = left_iids[~np.isin(left_iids, preserved_exers)]
np.random.shuffle(left_iids)
choose_iids = left_iids[0:need_preserved_num]

preserved_exers = np.hstack([preserved_exers, choose_iids])

return self.df_Q.copy()[self.df_Q['exer_id:token'].isin(preserved_exers)].reset_index(drop=True)


def get_Q_mat_from_df_arr(self, df_Q_arr, item_count, cpt_count):
Q_mat = torch.zeros((item_count, cpt_count), dtype=torch.int64)
for _, item in df_Q_arr.iterrows(): Q_mat[item['exer_id:token'], item['cpt_seq:token_seq']] = 1
return Q_mat
112 changes: 112 additions & 0 deletions edustudio/atom_op/mid2cache/common/fill_missing_Q.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
from .base_mid2cache import BaseMid2Cache
import numpy as np
import pandas as pd
from itertools import chain
import torch
from edustudio.utils.common import set_same_seeds, tensor2npy
from tqdm import tqdm

class M2C_FillMissingQ(BaseMid2Cache):
default_cfg = {
'Q_fill_type': "None",
'params_topk': 5,
'params_votek': 2,
}

def __init__(self, m2c_cfg, cfg) -> None:
self.logger = cfg.logger
self.m2c_cfg = m2c_cfg
self.cfg = cfg

@classmethod
def from_cfg(cls, cfg):
return cls(cfg.datatpl_cfg.get(cls.__name__), cfg)

def process(self, **kwargs):
dt_info = kwargs['dt_info']
self.user_count = dt_info['stu_count']
self.item_count = dt_info['exer_count']
self.cpt_count = dt_info['cpt_count']
self.df_Q = kwargs['df_exer'][['exer_id:token', 'cpt_seq:token_seq']]

Q_mat = kwargs['Q_mat']
missing_Q_mat = kwargs['missing_Q_mat']

self.filling_Q_mat_list = []
for df_train in kwargs['df_train_folds']:
if (missing_Q_mat.sum(dim=1) == 0).sum() > 0:
if self.m2c_cfg['Q_fill_type'] == "sim_dist_for_by_exer":
fill_df_Q = self.fill_df_Q_by_sim_dist(
df_train, kwargs['missing_df_Q'],
params_topk=self.m2c_cfg['params_topk'],
params_votek=self.m2c_cfg['params_votek']
)
fill_Q_mat = self.get_Q_mat_from_df_arr(fill_df_Q, self.item_count, self.cpt_count)
self.filling_Q_mat_list.append(fill_Q_mat)
elif self.m2c_cfg['Q_fill_type'] == "None":
self.filling_Q_mat_list.append(missing_Q_mat)
else:
raise ValueError(f"unknown Q_fill_type: {self.m2c_cfg['Q_fill_type']}")
else:
self.filling_Q_mat_list.append(Q_mat)

kwargs['filling_Q_mat_list'] = self.filling_Q_mat_list
return kwargs

def get_Q_mat_from_df_arr(self, df_Q_arr, item_count, cpt_count):
Q_mat = np.zeros((item_count, cpt_count), dtype=np.int64)
for _, item in df_Q_arr.iterrows(): Q_mat[item['exer_id:token'], item['cpt_seq:token_seq']] = 1
return Q_mat

def fill_df_Q_by_sim_dist(self, df_interaction, df_Q_left, params_topk=5, params_votek=2):
preserved_exers = df_Q_left['exer_id:token'].to_numpy()
interact_mat = torch.zeros((self.user_count, self.item_count), dtype=torch.int8).to(self.cfg.traintpl_cfg['device'])
idx = df_interaction[df_interaction['label:float'] == 1][['stu_id:token','exer_id:token']].to_numpy()
interact_mat[idx[:,0], idx[:,1]] = 1
idx = df_interaction[df_interaction['label:float'] != 1][['stu_id:token','exer_id:token']].to_numpy()
interact_mat[idx[:,0], idx[:,1]] = -1

interact_mat = interact_mat.T

sim_mat = torch.zeros((self.item_count, self.item_count))
missing_iids = np.array(list(set(np.arange(self.item_count)) - set(preserved_exers)))
for iid in tqdm(missing_iids, desc="[FILL_Q_MAT] compute sim_mat", ncols=self.cfg.frame_cfg['TQDM_NCOLS']):
temp = interact_mat[iid] != 0
same_mat = interact_mat[iid] == interact_mat
bool_mat = (temp) & (interact_mat != 0)
same_mat[~bool_mat] = False
sim_mat[iid] = same_mat.sum(dim=1) / (temp).sum()
sim_mat[iid, bool_mat.sum(dim=1) == 0] = 0.0
sim_mat[iid, iid] = -1.0
sim_mat[iid, missing_iids] = -1.0

assert torch.isnan(sim_mat).sum() == 0

_, topk_mat_idx = torch.topk(sim_mat, dim=1, k=params_topk, largest=True, sorted=True)
topk_mat_idx = tensor2npy(topk_mat_idx)

index_df_Q = df_Q_left.set_index('exer_id:token')
missing_iid_fill_cpts = {}
for iid in tqdm(missing_iids, desc="[FILL_Q_MAT] fill process", ncols=self.cfg.frame_cfg['TQDM_NCOLS']):
count_dict = dict(zip(*np.unique(
list(chain(*[index_df_Q.loc[iid2]['cpt_seq:token_seq'] for iid2 in topk_mat_idx[iid] if iid2 in preserved_exers])),
return_counts=True,
)))
count_dict = sorted(count_dict.items(), key=lambda x: x[1], reverse=True)
missing_iid_fill_cpts[iid] = [i[0] for i in count_dict[0:params_votek]]

missing_fill_df_Q = pd.DataFrame(
{'exer_id:token': list(missing_iid_fill_cpts.keys()),'cpt_seq:token_seq':list(missing_iid_fill_cpts.values())}
)
final_df_Q = pd.concat([df_Q_left, missing_fill_df_Q], axis=0, ignore_index=True)

hit_ratio = 0
t_Q = self.df_Q.set_index('exer_id:token')
for iid in missing_iid_fill_cpts:
if len(set(t_Q.loc[iid]['cpt_seq:token_seq']) & set(missing_iid_fill_cpts[iid])) > 0:
hit_ratio += 1
hit_ratio = hit_ratio / len(missing_iid_fill_cpts)

self.logger.info(f"[FILL_Q] Hit_ratio={hit_ratio}")

return final_df_Q
30 changes: 30 additions & 0 deletions edustudio/atom_op/mid2cache/common/filtering_records_by_attr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from .base_mid2cache import BaseMid2Cache
import pandas as pd
import numpy as np
from itertools import chain


class M2C_FilteringRecordsByAttr(BaseMid2Cache):
"""Commonly used by Fair Models, and Filtering Students without attribute values
"""
default_cfg = {
'filter_stu_attrs': ['gender:token']
}

def process(self, **kwargs):
df_stu = kwargs['df_stu']
df = kwargs['df']
df_stu = df_stu[df_stu[self.m2c_cfg['filter_stu_attrs']].notna().all(axis=1)].reset_index(drop=True)
df = df[df['stu_id:token'].isin(df_stu['stu_id:token'])].reset_index(drop=True)

kwargs['df'] = df
kwargs['df_stu'] = df_stu

return kwargs







4 changes: 2 additions & 2 deletions edustudio/atom_op/mid2cache/single/M2C_CL4KT_OP.py
Original file line number Diff line number Diff line change
@@ -27,8 +27,8 @@ def process(self, **kwargs):
def compute_cpt2difflevel(self, **kwargs):
cpt_correct = defaultdict(int)
cpt_count = defaultdict(int)
for i, (c_list, r_list) in enumerate(zip(kwargs['df_train_folds'][0]['cpt_unfold_seq:token_seq'], kwargs['df_train_folds'][0]['label_seq:float_seq'])):
for c, r in zip(c_list[kwargs['df_train_folds'][0]['mask_seq:token_seq'][i] == 1], r_list[kwargs['df_train_folds'][0]['mask_seq:token_seq'][i] == 1]):
for i, (c_list, r_list) in enumerate(zip(kwargs['df_seq']['cpt_unfold_seq:token_seq'], kwargs['df_seq']['label_seq:float_seq'])):
for c, r in zip(c_list[kwargs['df_seq']['mask_seq:token_seq'][i] == 1], r_list[kwargs['df_seq']['mask_seq:token_seq'][i] == 1]):
cpt_correct[c] += r
cpt_count[c] += 1
cpt_diff = {c: cpt_correct[c] / float(cpt_count[c]) for c in cpt_correct} # cpt difficult
14 changes: 12 additions & 2 deletions edustudio/atom_op/mid2cache/single/M2C_QDKT_OP.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import networkx as nx
from ..common import BaseMid2Cache
import torch
from torch.nn import functional as F
import numpy as np


@@ -12,11 +14,19 @@ def process(self, **kwargs):
self.num_q = dt_info['exer_count']
self.num_c = dt_info['cpt_count']
self.Q_mat = kwargs['Q_mat']
graph = self.generate_graph()
laplacian_matrix = self.laplacian_matrix(graph)
laplacian_matrix = self.laplacian_matrix_by_vectorization()
kwargs['laplacian_matrix'] = laplacian_matrix
return kwargs

def laplacian_matrix_by_vectorization(self):
normQ = F.normalize(self.Q_mat.float(), p=2, dim=-1)
A = torch.mm(normQ, normQ.T) > (1 - 1/len(normQ))
A = A.int() #Adjacency matrix
D = A.sum(-1, dtype=torch.int32)
diag_idx = [range(len(A)), range(len(A))]
A[diag_idx] = D - A[diag_idx]
return A

def generate_graph(self):

graph = nx.Graph()
5 changes: 4 additions & 1 deletion edustudio/atom_op/raw2mid/__init__.py
Original file line number Diff line number Diff line change
@@ -15,7 +15,8 @@
from .nips12 import R2M_Eedi_20_T12
from .nips34 import R2M_Eedi_20_T34
from .simulated5 import R2M_Simulated5

from .slp_english import R2M_SLP_English
from .slp_math import R2M_SLP_Math

# look up api dict
_cli_api_dict_ = {}
@@ -35,3 +36,5 @@
_cli_api_dict_['R2M_Eedi_20_T12'] = R2M_Eedi_20_T12.from_cli
_cli_api_dict_['R2M_Eedi_20_T34'] = R2M_Eedi_20_T34.from_cli
_cli_api_dict_['R2M_Simulated5'] = R2M_Simulated5.from_cli
_cli_api_dict_['R2M_SLP_Math'] = R2M_SLP_Math.from_cli
_cli_api_dict_['R2M_SLP_English'] = R2M_SLP_English.from_cli
2 changes: 1 addition & 1 deletion edustudio/atom_op/raw2mid/nips12.py
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@


class R2M_Eedi_20_T12(BaseRaw2Mid):
"""R2M_NIPS12 is to preprocess NIPS 2020 challenge Task 1&2 dataset"""
"""R2M_Eedi_20_T12 is to preprocess NIPS 2020 challenge Task 1&2 dataset"""
def process(self):
super().process()
# 读入数据 查看
91 changes: 91 additions & 0 deletions edustudio/atom_op/raw2mid/slp_english.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
from edustudio.atom_op.raw2mid import BaseRaw2Mid
import pandas as pd
import numpy as np
import time

"""
SLP Dataset: https://aic-fe.bnu.edu.cn/en/data/index.html
"""


class R2M_SLP_English(BaseRaw2Mid):
"""
rawdata: https://aic-fe.bnu.edu.cn/en/data/index.html
"""
def process(self):
super().process()

# for stu
df_stu = pd.read_csv(f"{self.rawpath}/student.csv")
df_stu.dropna(subset=['school_id'], inplace=True, how='any', axis=0)
df_stu = df_stu[df_stu['school_id'] != 'n.a.']

df_stu = df_stu.merge(
pd.read_csv(f"{self.rawpath}/family.csv", index_col=False),
on=['student_id'], how='inner'
)

df_stu = df_stu.merge(
pd.read_csv(f"{self.rawpath}/school.csv"),
on=['school_id'], how='inner'
)

df_stu.drop([
'rate_of_higher_educated_teachers',
"rate_of_teachers_with_master's_degree_and_above"
], inplace=True, axis=1)
df_stu.rename(columns={
'student_id': 'stu_id:token', 'gender': 'gender:token',
'school_id': 'sch_id:token', 'class_id': 'class_id:token',
'age_father': 'age_father:float', 'age_mother': 'age_mother:token',
'edubg_father': 'edubg_father:token', 'edubg_mother':'edubg_mother:token',
'affiliation_father':'affiliation_father:token',
'affiliation_mother': 'affiliation_mother:token',
'family_income': 'family_income:token', 'is_only_child':'is_only_child:token',
'live_on_campus': 'live_on_campus:token',
'gathering_frequency_father':'gathering_frequency_father:token',
'gathering_frequency_mother':'gathering_frequency_mother:token',
'family_traveling_times': "family_traveling_times:token",
'school_type': 'school_type:token',
'dist_to_downtown': 'dist_to_downtown:float',
#'rate_of_higher_educated_teachers': 'rate_of_higher_educated_teachers:float',
#"rate_of_teachers_with_master's_degree_and_above": "rate_of_teachers_with_master's_degree_and_above:float",
}, inplace=True)

# for inter
df_inter = pd.read_csv(f"{self.rawpath}/term-eng.csv", index_col=False, low_memory=False)
df_inter = df_inter[(df_inter == 'n.a.').sum(axis=1) == 0].reset_index(drop=True)
df_inter = df_inter[df_inter['concept'] != 'n.a.']
df_inter['label'] = df_inter['score']/df_inter['full_score'].astype(float)

df_exer = df_inter[['question_id', 'exam_id', 'subject_abbr', 'concept']]
df_inter = df_inter[['student_id', 'question_id', 'score', 'full_score', 'time_access', 'label']]
df_exer.drop_duplicates(subset=['question_id'], inplace=True)
df_exer['concept'] = df_exer['concept'].apply(lambda x: x.split(";"))
df_inter['time_access'] = df_inter['time_access'].apply(lambda x: self.convert2timestamp(x))

df_inter.rename(columns={
'student_id': 'stu_id:token', 'question_id': 'exer_id:token',
'score': 'score:float', 'full_score':'full_score:float',
'time_access': 'start_timestamp:float', 'label':'label:float'
}, inplace=True)

df_exer.rename(columns={
'question_id': 'exer_id:token',
'exam_id': 'exam_id:token',
'subject_abbr': 'subject_abbr:token',
'concept': 'cpt_seq:token_seq'
}, inplace=True)

df_inter['order_id:token'] = df_inter['start_timestamp:float'].astype(int)

# save
df_inter.to_csv(f"{self.midpath}/{self.dt}.inter.csv", index=False, encoding='utf-8')
df_stu.to_csv(f"{self.midpath}/{self.dt}.stu.csv", index=False, encoding='utf-8')
df_exer.to_csv(f"{self.midpath}/{self.dt}.exer.csv", index=False, encoding='utf-8')

@staticmethod
def convert2timestamp(dt):
timeArray = time.strptime(dt, "%Y-%m-%d %H:%M:%S")
timestamp = time.mktime(timeArray)
return timestamp
86 changes: 86 additions & 0 deletions edustudio/atom_op/raw2mid/slp_math.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
from edustudio.atom_op.raw2mid import BaseRaw2Mid
import pandas as pd
import numpy as np
import time

"""
SLP Dataset: https://aic-fe.bnu.edu.cn/en/data/index.html
"""

class R2M_SLP_Math(BaseRaw2Mid):
def process(self):
super().process()

# for stu
df_stu = pd.read_csv(f"{self.rawpath}/student.csv")
df_stu.dropna(subset=['school_id'], inplace=True, how='any', axis=0)
df_stu = df_stu[df_stu['school_id'] != 'n.a.']

df_stu = df_stu.merge(
pd.read_csv(f"{self.rawpath}/family.csv", index_col=False),
on=['student_id'], how='inner'
)

df_stu = df_stu.merge(
pd.read_csv(f"{self.rawpath}/school.csv"),
on=['school_id'], how='inner'
)

df_stu.drop([
'rate_of_higher_educated_teachers',
"rate_of_teachers_with_master's_degree_and_above"
], inplace=True, axis=1)
df_stu.rename(columns={
'student_id': 'stu_id:token', 'gender': 'gender:token',
'school_id': 'sch_id:token', 'class_id': 'class_id:token',
'age_father': 'age_father:float', 'age_mother': 'age_mother:token',
'edubg_father': 'edubg_father:token', 'edubg_mother':'edubg_mother:token',
'affiliation_father':'affiliation_father:token',
'affiliation_mother': 'affiliation_mother:token',
'family_income': 'family_income:token', 'is_only_child':'is_only_child:token',
'live_on_campus': 'live_on_campus:token',
'gathering_frequency_father':'gathering_frequency_father:token',
'gathering_frequency_mother':'gathering_frequency_mother:token',
'family_traveling_times': "family_traveling_times:token",
'school_type': 'school_type:token',
'dist_to_downtown': 'dist_to_downtown:float',
#'rate_of_higher_educated_teachers': 'rate_of_higher_educated_teachers:float',
#"rate_of_teachers_with_master's_degree_and_above": "rate_of_teachers_with_master's_degree_and_above:float",
}, inplace=True)

# for inter
df_inter = pd.read_csv(f"{self.rawpath}/term-mat.csv", index_col=False)
df_inter = df_inter[df_inter['concept'] != 'n.a.']
df_inter['label'] = df_inter['score']/df_inter['full_score']

df_exer = df_inter[['question_id', 'exam_id', 'subject_abbr', 'concept']]
df_inter = df_inter[['student_id', 'question_id', 'score', 'full_score', 'time_access', 'label']]
df_exer.drop_duplicates(subset=['question_id'], inplace=True)
df_exer['concept'] = df_exer['concept'].apply(lambda x: x.split(";"))
df_inter['time_access'] = df_inter['time_access'].apply(lambda x: self.convert2timestamp(x))

df_inter.rename(columns={
'student_id': 'stu_id:token', 'question_id': 'exer_id:token',
'score': 'score:float', 'full_score':'full_score:float',
'time_access': 'start_timestamp:float', 'label':'label:float'
}, inplace=True)

df_exer.rename(columns={
'question_id': 'exer_id:token',
'exam_id': 'exam_id:token',
'subject_abbr': 'subject_abbr:token',
'concept': 'cpt_seq:token_seq'
}, inplace=True)

df_inter['order_id:token'] = df_inter['start_timestamp:float'].astype(int)

# save
df_inter.to_csv(f"{self.midpath}/{self.dt}.inter.csv", index=False, encoding='utf-8')
df_stu.to_csv(f"{self.midpath}/{self.dt}.stu.csv", index=False, encoding='utf-8')
df_exer.to_csv(f"{self.midpath}/{self.dt}.exer.csv", index=False, encoding='utf-8')

@staticmethod
def convert2timestamp(dt):
timeArray = time.strptime(dt, "%Y-%m-%d %H:%M:%S")
timestamp = time.mktime(timeArray)
return timestamp
70 changes: 70 additions & 0 deletions edustudio/datatpl/CD/DCDDataTPL.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import os
from ..common.edu_datatpl import EduDataTPL
import json
from edustudio.datatpl.common.general_datatpl import DataTPLStatus
import torch


class DCDDataTPL(EduDataTPL):
default_cfg = {
'n_folds': 5,
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_FilterRecords4CD', 'M2C_ReMapId', 'M2C_RandomDataSplit4CD', 'M2C_GenQMat', 'M2C_BuildMissingQ', 'M2C_FillMissingQ'],
'cpt_relation_file_name': 'cpt_relation',
}

def __init__(self, cfg, df, df_train=None, df_valid=None, df_test=None, dict_cpt_relation=None, status=DataTPLStatus(), df_stu=None, df_exer=None):
self.dict_cpt_relation = dict_cpt_relation
super().__init__(cfg, df, df_train, df_valid, df_test, df_stu, df_exer, status)

def _check_param(self):
super()._check_params()
assert 0 <= self.datatpl_cfg['Q_delete_ratio'] < 1

@property
def common_str2df(self):
dic = super().common_str2df
dic['dict_cpt_relation'] = self.dict_cpt_relation
return dic


def process_data(self):
super().process_data()
dt_info = self.final_kwargs['dt_info']
user_count = dt_info['stu_count']
item_count = dt_info['exer_count']
self.interact_mat_list = []
for interact_df in self.final_kwargs['df_train_folds']:
interact_mat = torch.zeros((user_count, item_count), dtype=torch.int8)
idx = interact_df[interact_df['label:float'] == 1][['stu_id:token','exer_id:token']].to_numpy()
interact_mat[idx[:,0], idx[:,1]] = 1
idx = interact_df[interact_df['label:float'] != 1][['stu_id:token','exer_id:token']].to_numpy()
interact_mat[idx[:,0], idx[:,1]] = -1
self.interact_mat_list.append(interact_mat)

self.final_kwargs['interact_mat_list'] = self.interact_mat_list

if self.final_kwargs['dict_cpt_relation'] is None:
self.final_kwargs['dict_cpt_relation'] = {i: [i] for i in range(self.final_kwargs['dt_info']['cpt_count'])}

@classmethod
def load_data(cls, cfg):
kwargs = super().load_data(cfg)
fph = f"{cfg.frame_cfg.data_folder_path}/middata/{cfg.datatpl_cfg['cpt_relation_file_name']}.json"
if os.path.exists(fph):
with open(fph, 'r', encoding='utf-8') as f:
kwargs['dict_cpt_relation'] = json.load(f)
else:
cfg.logger.warning("without cpt_relation.json")
kwargs['dict_cpt_relation'] = None
return kwargs

def get_extra_data(self):
extra_dict = super().get_extra_data()
extra_dict['filling_Q_mat'] = self.filling_Q_mat
extra_dict['interact_mat'] = self.interact_mat
return extra_dict

def set_info_for_fold(self, fold_id):
super().set_info_for_fold(fold_id)
self.filling_Q_mat = self.final_kwargs['filling_Q_mat_list'][fold_id]
self.interact_mat = self.final_kwargs['interact_mat_list'][fold_id]
7 changes: 7 additions & 0 deletions edustudio/datatpl/CD/FAIRDataTPL.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from ..common import EduDataTPL

class FAIRDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_FilteringRecordsByAttr', 'M2C_FilterRecords4CD', 'M2C_ReMapId', 'M2C_RandomDataSplit4CD', 'M2C_GenQMat'],
}

4 changes: 2 additions & 2 deletions edustudio/datatpl/CD/RCDDataTPL.py
Original file line number Diff line number Diff line change
@@ -7,10 +7,10 @@ class RCDDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': [
'M2C_Label2Int', 'M2C_FilterRecords4CD', 'M2C_ReMapId',
'M2C_RandomDataSplit4CD', 'M2C_BuildCptRelation',
'M2C_RandomDataSplit4CD', 'M2C_BuildKCRelation',
'M2C_GenQMat', 'M2C_RCD_OP'
],
'M2C_BuildCptRelation': {
'M2C_BuildKCRelation': {
'relation_type': 'rcd_transition',
'threshold': None
}
4 changes: 3 additions & 1 deletion edustudio/datatpl/CD/__init__.py
Original file line number Diff line number Diff line change
@@ -7,4 +7,6 @@
from .CNCDQDataTPL import CNCDQDataTPL
from .RCDDataTPL import RCDDataTPL
from .CDGKDataTPL import CDGKDataTPL
from.ECDDataTPL import ECDDataTPL
from .ECDDataTPL import ECDDataTPL
from .DCDDataTPL import DCDDataTPL
from .FAIRDataTPL import FAIRDataTPL
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/CL4KTDataTPL.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

class CL4KTDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_GenUnFoldCptSeq', 'M2C_CL4KT_OP'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_GenUnFoldKCSeq', 'M2C_CL4KT_OP', 'M2C_RandomDataSplit4KT'],
'M2C_CL4KT_OP': {
'sequence_truncation': 'recent',
}
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/DIMKTDataTPL.py
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@

class DIMKTDataTPL(KTInterExtendsQDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_GenUnFoldCptSeq', 'M2C_BuildSeqInterFeats', 'M2C_GenCptSeq', "M2C_DIMKT_OP"],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_GenUnFoldKCSeq', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT', 'M2C_GenKCSeq', "M2C_DIMKT_OP"],
'M2C_BuildSeqInterFeats': {
# 'window_size': 200,
"extra_inter_feats": ['start_timestamp:float', 'cpt_unfold:token']
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/DKTDSCDataTPL.py
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@

class DKTDSCDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ["M2C_CptAsExer", 'M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', "M2C_DKTDSC_OP"],
'mid2cache_op_seq': ["M2C_KCAsExer", 'M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats','M2C_RandomDataSplit4KT', "M2C_DKTDSC_OP"],
}

def __getitem__(self, index):
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/DKTForgetDataTPL.py
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@

class DKTForgetDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', "M2C_DKTForget_OP"],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats','M2C_RandomDataSplit4KT', "M2C_DKTForget_OP"],
'M2C_BuildSeqInterFeats': {
"extra_inter_feats": ['start_timestamp:float']
}
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/EERNNDataTPL.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@

class EERNNDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_EERNN_OP'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats','M2C_RandomDataSplit4KT', 'M2C_EERNN_OP'],
}

def get_extra_data(self, **kwargs):
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/EKTDataTPL.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@

class EKTDataTPL(EERNNDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_GenCptSeq', 'M2C_EERNN_OP'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT', 'M2C_GenKCSeq', 'M2C_EERNN_OP'],
}

def __getitem__(self, index):
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/GKTDataTPL.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@

class GKTDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ["M2C_CptAsExer", 'M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats'],
'mid2cache_op_seq': ["M2C_KCAsExer", 'M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT'],
}

def process_load_data_from_middata(self):
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/KTInterCptAsExerDataTPL.py
Original file line number Diff line number Diff line change
@@ -2,6 +2,6 @@

class KTInterCptAsExerDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ["M2C_CptAsExer", 'M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats'],
'mid2cache_op_seq': ["M2C_KCAsExer", 'M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT'],
}

2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/KTInterCptUnfoldDataTPL.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

class KTInterCptUnfoldDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_GenUnFoldCptSeq', 'M2C_BuildSeqInterFeats'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_GenUnFoldKCSeq', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT'],
'M2C_BuildSeqInterFeats': {
"extra_inter_feats": ['start_timestamp:float', 'cpt_unfold:token']
}
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/KTInterDataTPL.py
Original file line number Diff line number Diff line change
@@ -2,6 +2,6 @@

class KTInterDataTPL(GeneralDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT'],
}

2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/KTInterExtendsQDataTPL.py
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@

class KTInterExtendsQDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_GenCptSeq'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT', 'M2C_GenKCSeq'],
}

def __getitem__(self, index):
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/LPKTDataTPL.py
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@

class LPKTDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_LPKT_OP', "M2C_GenQMat"],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT', 'M2C_LPKT_OP', "M2C_GenQMat"],
'M2C_BuildSeqInterFeats': {
"extra_inter_feats": ['start_timestamp:float', 'answer_time:float']
}
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/QDKTDataTPL.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@

class QDKTDataTPL(KTInterExtendsQDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_GenCptSeq','M2C_GenQMat','M2C_QDKT_OP'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT', 'M2C_GenKCSeq','M2C_GenQMat','M2C_QDKT_OP'],
}

def get_extra_data(self, **kwargs):
2 changes: 1 addition & 1 deletion edustudio/datatpl/KT/RKTDataTPL.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@

class RKTDataTPL(EduDataTPL):
default_cfg = {
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId','M2C_GenQMat', 'M2C_BuildSeqInterFeats'],
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_ReMapId','M2C_GenQMat', 'M2C_BuildSeqInterFeats', 'M2C_RandomDataSplit4KT'],
'M2C_BuildSeqInterFeats': {
"extra_inter_feats": ['start_timestamp:float']
}
20 changes: 16 additions & 4 deletions edustudio/datatpl/common/base_datatpl.py
Original file line number Diff line number Diff line change
@@ -6,6 +6,7 @@
import yaml
import re
import os
import requests


class BaseDataTPL(Dataset):
@@ -73,15 +74,26 @@ def download_dataset(cls, cfg):
cfg (UnifyConfig):the global config object
"""
dt_name = cfg.dataset
cfg.logger.warning(f"Can't find dataset files of {dt_name} in local environment!")
cfg.logger.info(f"Prepare to download {dt_name} from Internet.")
cfg.logger.warning(f"Can't find dataset files of {dt_name} in local disk")

fph = cfg.frame_cfg['DT_INFO_FILE_PATH']
dataset_info = cls.read_yml_file(fph)
dataset_info_from_cfg: dict = cfg['frame_cfg']['DT_INFO_DICT']
dataset_info_from_cfg.update(dataset_info)
dataset_info.update(dataset_info_from_cfg)

if dt_name not in dataset_info:
raise Exception("Can't find dataset files from Local and Internet!")
cfg.logger.info(f"Prepare download external datasets.yaml to find dataset:{dt_name}")
url = "https://huggingface.co/datasets/lmcRS/edustudio-datasets/raw/main/datasets.yaml"
cfg.logger.info(f"Eexternal datasets.yaml url: {url}")
resp = requests.get(url)
dataset_info_external = yaml.load(resp.text, Loader=cls._build_yaml_loader())
if dt_name not in dataset_info_external:
raise Exception("Can't find dataset files from local disk and online")
else:
dataset_info.update(dataset_info_external)

cfg.logger.info(f"Prepare to download {dt_name} dataset from online")
cfg.logger.info(f"Download_url: {dataset_info[dt_name]['middata_url']}")

if not os.path.exists(cfg.frame_cfg.data_folder_path):
os.makedirs(cfg.frame_cfg.data_folder_path)
44 changes: 31 additions & 13 deletions edustudio/datatpl/common/general_datatpl.py
Original file line number Diff line number Diff line change
@@ -36,7 +36,7 @@ class GeneralDataTPL(BaseDataTPL):
'cache_id': 'cache_default',
'load_data_from': 'middata', # ['rawdata', 'middata', 'cachedata']
'inter_exclude_feat_names': (),
'raw2mid_op': None,
'raw2mid_op': "None",
'mid2cache_op_seq': []
}

@@ -70,6 +70,7 @@ def __init__(
if self.datatpl_cfg['load_data_from'] == 'cachedata':
self.load_cache()
self.check_cache()
self.process_data()
self.logger.info(f"Load from cache successfully: {self.datatpl_cfg['cache_id']}")
self.logger.info(self.datatpl_cfg['dt_info'])
else:
@@ -90,8 +91,7 @@ def from_cfg(cls, cfg):
Returns:
BaseDataTPL
"""
if not os.path.exists(f'{cfg.frame_cfg.data_folder_path}'):
print(cfg.frame_cfg.data_folder_path)
if not os.path.exists(cfg.frame_cfg.data_folder_path) or len(os.listdir(cfg.frame_cfg.data_folder_path)) == 0:
cls.download_dataset(cfg)

load_data_from = cfg.datatpl_cfg['load_data_from']
@@ -141,8 +141,6 @@ def process_data(self):
load_data_from = self.datatpl_cfg['load_data_from']
if load_data_from != 'cachedata':
self.process_load_data_from_middata()
else:
raise ValueError(f"load_data_from={load_data_from} is not expected to appear here")

@classmethod
def load_data(cls, cfg): # 只在middata存在时调用
@@ -240,7 +238,7 @@ def save_cache(self):
self.save_pickle(final_kwargs_fph, self.final_kwargs)

with open(f"{self.cache_folder_path}/datatpl_cfg.json", 'w', encoding='utf-8') as f:
json.dump(json.loads(self.datatpl_cfg.dump_tpl()), fp=f, indent=2, ensure_ascii=False)
json.dump(json.loads(self.datatpl_cfg.dump_fmt()), fp=f, indent=2, ensure_ascii=False)

def check_cache(self):
"""check whether the cache data is consistent with current config
@@ -251,11 +249,15 @@ def check_cache(self):
temp_cache_datatpl_cfg = copy.deepcopy(cache_datatpl_cfg)
del temp_cache_datatpl_cfg['dt_info']
del temp_cache_datatpl_cfg['load_data_from']
if 'is_save_cache' in temp_cache_datatpl_cfg:
del temp_cache_datatpl_cfg['is_save_cache']
# del temp_cache_datatpl_cfg['raw2mid_op']
# del temp_cache_datatpl_cfg['mid2cache_op_seq']
curr_datatpl_cfg = copy.deepcopy(json.loads(self.datatpl_cfg.dump_tpl()))
curr_datatpl_cfg = copy.deepcopy(json.loads(self.datatpl_cfg.dump_fmt()))
del curr_datatpl_cfg['dt_info']
del curr_datatpl_cfg['load_data_from']
if 'is_save_cache' in curr_datatpl_cfg:
del curr_datatpl_cfg['is_save_cache']
# del curr_datatpl_cfg['raw2mid_op']
# del curr_datatpl_cfg['mid2cache_op_seq']
diff = DeepDiff(temp_cache_datatpl_cfg, curr_datatpl_cfg)
@@ -283,6 +285,13 @@ def load_cache(self):
self.dict_test_folds = self.load_pickle(test_folds_fph)
self.final_kwargs = self.load_pickle(final_kwargs_fph)

for k,v in self.final_kwargs.items():
if not hasattr(self, k):
setattr(self, k, v)
self.logger.info(f"[load cache] set {k} from final_kwargs to current data template")
else:
self.logger.info(f"[load cache] duplicated attribute in final_kwargs: {k}")

def build_datasets(self):
"""build datasets
"""
@@ -457,7 +466,7 @@ def _get_r2m_op(cls, cfg):
"""
from edustudio.atom_op.raw2mid import BaseRaw2Mid
r2m_op = cfg.datatpl_cfg['raw2mid_op']
assert r2m_op is not None
assert r2m_op is not None or r2m_op != "None"
if isinstance(r2m_op, str):
r2m_op = importlib.import_module('edustudio.atom_op.raw2mid').__getattribute__(r2m_op)
elif issubclass(r2m_op, BaseRaw2Mid):
@@ -542,16 +551,25 @@ def _preprocess_feat(df):
for col in df.columns:
col_name, col_type = col.split(":")
if col_type == 'token':
df[col] = df[col].astype('int64')
try:
df[col] = df[col].astype('int64')
except:
pass
elif col_type == 'float':
df[col] = df[col].astype('float32')
elif col_type == 'token_seq':
df[col] = df[col].astype(str).apply(lambda x: [int(i) for i in x.split(",")])
try:
df[col] = df[col].astype(str).apply(lambda x: [int(i) for i in x.split(",")])
except:
df[col] = df[col].astype(str).apply(lambda x: eval(x))
elif col_type == 'float_seq':
df[col] = df[col].astype(str).apply(lambda x: [float(i) for i in x.split(",")])
try:
df[col] = df[col].astype(str).apply(lambda x: [float(i) for i in x.split(",")])
except:
df[col] = df[col].astype(str).apply(lambda x: eval(x))
else:
raise ValueError(f"unknown field type of {col_type}")

pass
@staticmethod
def _unwrap_feat(df:pd.DataFrame):
"""unwrap the type of field
2 changes: 1 addition & 1 deletion edustudio/datatpl/utils/common.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@
class BigfileDownloader(object):
@staticmethod
def download(url, title, filepath, chunk_size=10240):
with closing(requests.get(url, stream=True)) as resp:
with closing(requests.get(url, stream=True, allow_redirects=True)) as resp:
if resp.status_code != 200:
raise Exception("[ERROR]: {} - {} -{}".format(str(resp.status_code), title, url))
chunk_size = chunk_size
13 changes: 9 additions & 4 deletions edustudio/datatpl/utils/pad_seq_util.py
Original file line number Diff line number Diff line change
@@ -53,10 +53,15 @@ def pad_sequence(

if return_idx:
return_idx = np.concatenate(return_idx_list).astype(np.int64)

is_dtype_str = np.issubdtype(dtype, np.str_) or np.issubdtype(
dtype, np.unicode_
)

version = np.__version__

if version.startswith('2.'):
is_dtype_str = np.issubdtype(dtype, np.str_)
else:
is_dtype_str = np.issubdtype(dtype, np.str_) or np.issubdtype(
dtype, np.unicode_
)
if isinstance(value, str) and dtype != object and not is_dtype_str:
raise ValueError(
f"`dtype` {dtype} is not compatible with `value`'s type: "
Loading