[FEATURE] Update D2V, AutoTokenizer, and pretraining scripts #155

KenelmQLH · 2024-02-22T11:46:22Z

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(Brief description on what this PR is about)

What does this implement/fix? Explain your changes.

...

Pull request type

[DATASET] Add a new dataset
[BUGFIX] Bugfix
[FEATURE] New feature (non-breaking change which adds functionality)
[BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[STYLE] Code style update (formatting, renaming)
[REFACTOR] Refactoring (no functional changes, no api changes)
[BUILD] Build related changes
[DOC] Documentation content changes
[OTHER] Other (please describe):

Changes

Update D2V: support for token vectors
Add AutoTokenizer
Update pretraining scripts for Disenq and QuesNet

Does this close any currently open issues?

N/A

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage and al tests passing
Code is well-documented (extended the README / documentation, if necessary)
If this PR is your first one, add your name and github account to AUTHORS.md

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov-commenter · 2024-02-22T13:09:21Z

Codecov Report

Attention: Patch coverage is 93.27586% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 97.31%. Comparing base (598d788) to head (84b79c7).

Files	Patch %	Lines
EduNLP/Pretrain/quesnet_vec.py	91.26%	11 Missing ⚠️
EduNLP/Pretrain/disenqnet_vec.py	90.41%	7 Missing ⚠️
EduNLP/I2V/i2v.py	71.42%	6 Missing ⚠️
EduNLP/ModelZoo/hf_model/hf_model.py	96.07%	4 Missing ⚠️
EduNLP/Vector/gensim_vec.py	82.35%	3 Missing ⚠️
EduNLP/SIF/tokenization/formula/ast_token.py	86.66%	2 Missing ⚠️
EduNLP/SIF/tokenization/tokenization.py	71.42%	2 Missing ⚠️
EduNLP/ModelZoo/quesnet/quesnet.py	96.42%	1 Missing ⚠️
EduNLP/Pretrain/elmo_vec.py	95.00%	1 Missing ⚠️
EduNLP/Pretrain/hugginface_utils.py	90.00%	1 Missing ⚠️
... and 1 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #155      +/-   ##
==========================================
- Coverage   97.81%   97.31%   -0.51%     
==========================================
  Files          80       84       +4     
  Lines        4349     4650     +301     
==========================================
+ Hits         4254     4525     +271     
- Misses         95      125      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nnnyt · 2024-02-26T08:55:14Z

EduNLP/ModelZoo/hf_model/hf_model.py

+        bert_config = AutoConfig.from_pretrained(pretrained_model_dir)
+        if init:
+            logger.info(f'Load AutoModel from checkpoint: {pretrained_model_dir}')
+            self.bert = AutoModel.from_pretrained(pretrained_model_dir)


change this to sth like self.model? AutoModel should not be constrained to BERT

nnnyt · 2024-02-26T08:55:34Z

EduNLP/ModelZoo/hf_model/hf_model.py

+        bert_config = AutoConfig.from_pretrained(pretrained_model_dir)
+        if init:
+            logger.info(f'Load AutoModel from checkpoint: {pretrained_model_dir}')
+            self.bert = AutoModel.from_pretrained(pretrained_model_dir)


nnnyt · 2024-02-26T09:03:07Z

EduNLP/Pretrain/auto_vec.py

+    pass
+
+
+def finetune_edu_auto_model(


Should it be sth like pretrain_hf_auto_model? It is only used for huggingface models. Also, it is domain pretraining instead of fine-tuning?

And the file name should be hf_auto_vec? It is not auto for our educational models

And pretrain_bert, finetune_bert_for_xx should be the same with these auto functions? Maybe with these auto function, we can delete the bert_vec? Not sure if it is better or not. Or we can keep that file but directly reuse these auto functions?

nnnyt · 2024-02-26T09:07:27Z

EduNLP/Pretrain/elmo_vec.py


-__all__ = ["ElmoTokenizer", "ElmoDataset", "train_elmo", "train_elmo_for_property_prediction",
-           "train_elmo_for_knowledge_prediction"]
+__all__ = ["ElmoTokenizer", "ElmoDataset", "pretrain_elmo", "pretrain_elmo_for_property_prediction",


should be finetune_elmo_for_xxx
Maybe that's my task lol

nnnyt · 2024-02-26T09:09:46Z

The test coverage seems to drop a lot. Try adding more tests for your new code

KenelmQLH added 6 commits January 25, 2024 14:10

[FEATURE] Compatible with huggingface AutoModel for Pretrain

49f4255

upadte disenq and quesnet and d2v

580ce7a

fix pretrain and start debug model

7285c8e

fix quesnet model ERROR

bda8b41

fix grammer

2b2fbe1

Upate scripts

5e32928

KenelmQLH added the enhancement New feature or request label Feb 22, 2024

KenelmQLH requested a review from nnnyt February 22, 2024 11:46

fix env

11dee5e

nnnyt requested changes Feb 26, 2024

View reviewed changes

KenelmQLH and others added 4 commits March 2, 2024 15:56

update vec

df6ae97

Add test

84b79c7

Update setup.py

85ab9fe

Update setup.py

d675143

nnnyt approved these changes Mar 4, 2024

View reviewed changes

nnnyt merged commit 47bfce8 into bigdata-ustc:dev Mar 4, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Update D2V, AutoTokenizer, and pretraining scripts #155

[FEATURE] Update D2V, AutoTokenizer, and pretraining scripts #155

KenelmQLH commented Feb 22, 2024

codecov-commenter commented Feb 22, 2024 •

edited

Loading

nnnyt Feb 26, 2024

nnnyt Feb 26, 2024

nnnyt Feb 26, 2024

nnnyt Feb 26, 2024

nnnyt Feb 26, 2024

nnnyt Feb 26, 2024

nnnyt commented Feb 26, 2024

[FEATURE] Update D2V, AutoTokenizer, and pretraining scripts #155

[FEATURE] Update D2V, AutoTokenizer, and pretraining scripts #155

Conversation

KenelmQLH commented Feb 22, 2024

Description

What does this implement/fix? Explain your changes.

Pull request type

Changes

Does this close any currently open issues?

Any relevant logs, error output, etc?

Checklist

Essentials

Comments

codecov-commenter commented Feb 22, 2024 • edited Loading

Codecov Report

nnnyt Feb 26, 2024

Choose a reason for hiding this comment

nnnyt Feb 26, 2024

Choose a reason for hiding this comment

nnnyt Feb 26, 2024

Choose a reason for hiding this comment

nnnyt Feb 26, 2024

Choose a reason for hiding this comment

nnnyt Feb 26, 2024

Choose a reason for hiding this comment

nnnyt Feb 26, 2024

Choose a reason for hiding this comment

nnnyt commented Feb 26, 2024

codecov-commenter commented Feb 22, 2024 •

edited

Loading