You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In general there is a lot of questionable language in the reddit dataset, as it is totally unfiltered and we are including all subreddits including 'nsfw' ones. It is still natural language, and a potentially useful learning signal, though of course we need to be careful how the resulting model is used.
We could perhaps add flags to the pipeline for filtering based on the nsfw label etc. These would be off by default.
Do you think it makes sense to remove samples containing profanity?
The text was updated successfully, but these errors were encountered: