Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDA Example #343

Open
wants to merge 91 commits into
base: master
Choose a base branch
from
Open

UDA Example #343

wants to merge 91 commits into from

Conversation

jrxk
Copy link
Collaborator

@jrxk jrxk commented Dec 19, 2020

This PR fixes #293.

Description of changes
This PR adds an example that uses UDA to train a text classifier on the IMDB text classification dataset. Please see the README for details.

Test Conducted
Performed experiments with/without UDA, under supervised and semi-supervised settings.

@jrxk jrxk added the data_aug Features on data augmentation label Dec 19, 2020
@jrxk jrxk self-assigned this Dec 19, 2020
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole thing is still non-forte, without clear documentation of how to use UDA.

If a user looks at this example, he wouldn't know how to incorporate UDA. There is not explanation in the readme, no comments in the main.py. All he can see is a 500 line code and don't know where to start, or don't know which part is important.

Previous comments on the last PR are not fixed, such as calling subprocess in Python, and there is no use of Forte at all. Why would I need this framework instead of going to Google's code?

examples/text_classification/download_imdb.py Outdated Show resolved Hide resolved
examples/text_classification/download_imdb.py Outdated Show resolved Hide resolved
examples/text_classification/utils/imdb_format.py Outdated Show resolved Hide resolved
examples/text_classification/utils/data_utils.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 24, 2020

Codecov Report

Merging #343 (4c1b6fb) into master (900cf93) will decrease coverage by 0.09%.
The diff coverage is 54.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #343      +/-   ##
==========================================
- Coverage   81.87%   81.78%   -0.10%     
==========================================
  Files         187      187              
  Lines       11941    11940       -1     
==========================================
- Hits         9777     9765      -12     
- Misses       2164     2175      +11     
Impacted Files Coverage Δ
forte/data/readers/largemovie_reader.py 77.35% <52.17%> (-20.69%) ⬇️
tests/forte/data/readers/largemovie_reader_test.py 97.29% <100.00%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 900cf93...4c1b6fb. Read the comment docs.

Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better place all data aug examples in the same folder: https://github.com/asyml/forte/tree/master/examples/data_augmentation

examples/text_classification/utils/data_utils.py Outdated Show resolved Hide resolved
examples/text_classification/utils/data_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is still at a very low quality, which cannot be merged at its current state? For example, where is the step to generate the back translation data? How can the user do it on his own?

Here are a list of problems:

  1. there's no variable type
  2. there is no instruction on how to get the back translation data.
  3. A lot of code is directly copied from Google.
  4. All the docstrings do not follow our standard.
  5. After mentioning using the training team's format, there is no change, there are still a lot of ad-hoc code such as InputFeatures, InputExample.

If that's the quality of this work, there's no reason that a user would use this instead of Google's implementation.

examples/data_augmentation/uda/config_data.py Outdated Show resolved Hide resolved
examples/data_augmentation/uda/utils/data_utils.py Outdated Show resolved Hide resolved
examples/data_augmentation/uda/main.py Show resolved Hide resolved
examples/data_augmentation/uda/utils/imdb_format.py Outdated Show resolved Hide resolved
examples/data_augmentation/uda/utils/data_utils.py Outdated Show resolved Hide resolved
examples/data_augmentation/uda/main.py Outdated Show resolved Hide resolved
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see back_trans folder with the requirement file, I guess you need to refer to Google?

Another way is to document how to get back translation from Google with steps, but not copying the code over.

Again, most docstrings are not following our standard.

forte/data/readers/largemovie_reader.py Outdated Show resolved Hide resolved
examples/data_augmentation/uda/utils/model_utils.py Outdated Show resolved Hide resolved
@jrxk
Copy link
Collaborator Author

jrxk commented Dec 29, 2020

I only see back_trans folder with the requirement file, I guess you need to refer to Google?

Another way is to document how to get back translation from Google with steps, but not copying the code over.

Again, most docstrings are not following our standard.

Sure. It's still a work in progress. I will add instructions on how to install tensor2tensor with the right versions of Tensorflow and and use the pre-trained translation models. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data_aug Features on data augmentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a text classifier for the IMDB large movie dataset
3 participants