More data: Create two datasets optimized for Vader, and upload to datasets #14

amirabdullah19852020 · 2024-04-07T20:14:21Z

The current IMDB dataset suffers from very few examples of the actual vader lexicon. As such, let's create two new datasets that have high overlap with the vader lexicon.

A simple version, that picks from openwebtext, and uses sentences that have high overlap with vader.
A "poisoned" version that flips the reward of 30 of the vader tokens. This will give us a base line to see if our IRM's can recover these tokens.

The columns of the dataset will be text, lexicon_tokens, token_rewards_dict and poisoned which is a (usually empty) list of tokens. There were will be 30 of these.

The vader lexicon tokens will be ordered by their frequency in english, and the top 4000 will be picked, with 5 occurrences each.

The text was updated successfully, but these errors were encountered:

amirabdullah19852020 self-assigned this Apr 7, 2024

amirabdullah19852020 added the data label Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More data: Create two datasets optimized for Vader, and upload to datasets #14

More data: Create two datasets optimized for Vader, and upload to datasets #14

amirabdullah19852020 commented Apr 7, 2024

More data: Create two datasets optimized for Vader, and upload to datasets #14

More data: Create two datasets optimized for Vader, and upload to datasets #14

Comments

amirabdullah19852020 commented Apr 7, 2024