Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More data: Create two datasets optimized for Vader, and upload to datasets #14

Open
amirabdullah19852020 opened this issue Apr 7, 2024 · 0 comments
Assignees
Labels

Comments

@amirabdullah19852020
Copy link
Collaborator

The current IMDB dataset suffers from very few examples of the actual vader lexicon. As such, let's create two new datasets that have high overlap with the vader lexicon.

  1. A simple version, that picks from openwebtext, and uses sentences that have high overlap with vader.
  2. A "poisoned" version that flips the reward of 30 of the vader tokens. This will give us a base line to see if our IRM's can recover these tokens.

The columns of the dataset will be text, lexicon_tokens, token_rewards_dict and poisoned which is a (usually empty) list of tokens. There were will be 30 of these.

The vader lexicon tokens will be ordered by their frequency in english, and the top 4000 will be picked, with 5 occurrences each.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant