Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/utils get and tokenize dataset #4

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sahilsharma05
Copy link

Found some issue in get_and_tokenize_dataset method under utils.py.
Code changes :

  • Invalid key "labels" when calling dataset_map['labels'] -> replaced by DATASETS_LABELS_URL[dataset_dir]
  • Invalid test.txt and text.labels.txt file for IMDB dataset (doesn't exist on s3). Replaced by valid.txt and valid.labels.txt respectively.
  • replaced 'test' with 'valid' key in global variables DATASETS_URL and DATASETS_LABELS_URL. The training logic is using dataset using valid key.
  • added test case for checking get_and_tokenize_dataset fixes for imdb dataset.

- replaced 'test' with 'valid' key in global variables DATASETS_URL and DATASETS_LABELS_URL. The training logic is using valid files.
- for IMDB dataset, test.txt and test.labels.txt doesn't exist. replaced with valid.txt and valid.labels.txt (exist on s3 server).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant