Removing samples containing profanity #51

vmurahari3 · 2019-06-29T19:27:45Z

Do you think it makes sense to remove samples containing profanity?

matthen · 2019-07-01T01:47:14Z

In general there is a lot of questionable language in the reddit dataset, as it is totally unfiltered and we are including all subreddits including 'nsfw' ones. It is still natural language, and a potentially useful learning signal, though of course we need to be careful how the resulting model is used.

We could perhaps add flags to the pipeline for filtering based on the nsfw label etc. These would be off by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing samples containing profanity #51

Removing samples containing profanity #51

vmurahari3 commented Jun 29, 2019

matthen commented Jul 1, 2019

Removing samples containing profanity #51

Removing samples containing profanity #51

Comments

vmurahari3 commented Jun 29, 2019

matthen commented Jul 1, 2019