From 2627f263bc363ee48a41164393f33d697d09bff9 Mon Sep 17 00:00:00 2001 From: ilyasu123 Date: Mon, 4 Jul 2016 10:57:32 -0700 Subject: [PATCH] Revert "Comedian neural network research request" --- _requests_for_research/funnybot.html | 77 --------------------------- _requests_for_research/funnybot.html~ | 77 --------------------------- 2 files changed, 154 deletions(-) delete mode 100644 _requests_for_research/funnybot.html delete mode 100644 _requests_for_research/funnybot.html~ diff --git a/_requests_for_research/funnybot.html b/_requests_for_research/funnybot.html deleted file mode 100644 index f6bc7ff..0000000 --- a/_requests_for_research/funnybot.html +++ /dev/null @@ -1,77 +0,0 @@ ---- -title: Comedian language model -summary: '' -difficulty: 2 # out of 3 ---- - -

Train a language model capable of generating jokes from one of -predefined categories. -This request can be solved as follows: -

- -

-Firstly, obtain a large corpus of jokes in a raw text format, that will -later be used to train a language model. For initial tests you can use -following existing datasets: -

-

-Pun of the Day dataset: ~2500 puns; 16000 One Liners dataset: ~16000 -one line jokes; See -[Yang, Diyi, et al. "Humor recognition and humor anchor extraction."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.696.1901&rep=rep1&type=pdf) -

-

-[Jester datasets](http://eigentaste.berkeley.edu/dataset/) : contains around -150 jokes and ratings from a large number of users. -

-

-Train a [language model](https://arxiv.org/abs/1602.02410) -on jokes datasets, similarily as [in this post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). -See if models trained on above datasets produce reasonable results. -

-

-Most likely to obtainin a reasonable language model more -data is required then what is available in above datasets. -Obtain such additional data by web scraping sites like -https://www.reddit.com/r/jokes, -http://funtweets.com/, -http://funnytweeter.com/ -and similar sites. -Please make sure to obey website policies with respect to the web scraping! -You can download reddit comments (not only for jokes) using [this torrent](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/). -

-

-To further increase the amount of text that language model is trained on, -pretrain a language model on a large corpus of regular text, for example -on the [One Billion Words dataset](http://arxiv.org/abs/1312.3005), which you can -download [here](http://www.statmt.org/lm-benchmark/). This way your language -model will already be initialized with the knowledge of the general structures in -English language; Fine tune such pretrained model on a jokes corpus. -One of the outcomes of this research request is to determine whether pretraining -helps with the joke generation. -

-

-People are all different, and so are their tastes in jokes, thus some -might prefer a certain category of jokes over others. Modify a language -model used in previous steps such that it can be configured to generate -jokes from a certain category only. To do so, train a language model using jokes -from https://www.reddit.com/r/jokes on both joke text and one hot encoded -label of joke, such that the language model can be configured to generate jokes -only of a certain type by fixing the corresponding one hot encoded input label. -For the other datasets, jokes can be labeled using a text classifier -trained to detect the reddit label from the joke text. -

-

-Expected outcome of this research request is to determine whether a reasonable -language model as described in the previous paragraph can be built with current -language modelling approaches. -

- -

Related literature: -[etrovic, Sasa, and David Matthews. "Unsupervised joke generation from big data." ACL (2). 2013.](http://homepages.inf.ed.ac.uk/s0894589/petrovic13unsupervised.pdf) -

- -

Potential follow up after this research request is solved is to create a -setup in which created neural network is able to -post jokes somewhere online where it can get a reliable score feedback -for jokes it generates and improve itself accordingly using RL. -

diff --git a/_requests_for_research/funnybot.html~ b/_requests_for_research/funnybot.html~ deleted file mode 100644 index d5a7815..0000000 --- a/_requests_for_research/funnybot.html~ +++ /dev/null @@ -1,77 +0,0 @@ ---- -title: Comedian language model -summary: '' -difficulty: 2 # out of 3 ---- - -

Train a language model capable of generating jokes from one of -predefined categories. -This request can be solved as follows: -

- -

-Firstly, obtain a large corpus of jokes in a raw text format, that will -later be used to train a language model. For initial tests you can use -following existing datasets: -

-

-Pun of the Day dataset: ~2500 puns; 16000 One Liners dataset: ~16000 -one line jokes; See -[Yang, Diyi, et al. "Humor recognition and humor anchor extraction."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.696.1901&rep=rep1&type=pdf) -

-

-[Jester datasets](http://eigentaste.berkeley.edu/dataset/) : contains around -150 jokes and ratings from a large number of users. -

-

-Train a large [language model](https://arxiv.org/abs/1602.02410) -on jokes datasets, similarily as [in this post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). -See if models trained on above datasets produce reasonable results. -

-

-Most likely to obtainin a reasonable language model more -data is required then what is available in above datasets. -Obtain such additional data by web scraping sites like -https://www.reddit.com/r/jokes, -http://funtweets.com/, -http://funnytweeter.com/ -and similar sites. -Please make sure to obey website policies with respect to the web scraping! -You can download reddit comments (not only for jokes) using [this torrent](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/). -

-

-To further increase the amount of text that language model is trained on, -pretrain a language model on a large corpus of regular text, for example -on the [One Billion Words dataset](http://arxiv.org/abs/1312.3005), which you can -download [here](http://www.statmt.org/lm-benchmark/). This way your language -model will already be initialized with the knowledge of the general structures in -English language; Fine tune such pretrained model on a jokes corpus. -One of the outcomes of this research request is to determine whether pretraining -helps with the joke generation. -

-

-People are all different, and so are their tastes in jokes, thus some -might prefer a certain category of jokes over others. Modify a language -model used in previous steps such that it can be configured to generate -jokes from a certain category only. To do so, train a language model using jokes -from https://www.reddit.com/r/jokes on both joke text and one hot encoded -label of joke, such that the language model can be configured to generate jokes -only of a certain type by fixing the corresponding one hot encoded input label. -For the other datasets, jokes can be labeled using a text classifier -trained to detect the reddit label from the joke text. -

-

-Expected outcome of this research request is to determine whether a reasonable -language model as described in the previous paragraph can be built with current -language modelling approaches. -

- -

Related literature: -[etrovic, Sasa, and David Matthews. "Unsupervised joke generation from big data." ACL (2). 2013.](http://homepages.inf.ed.ac.uk/s0894589/petrovic13unsupervised.pdf) -

- -

Potential follow up after this research request is solved is to create a -setup in which created neural network is able to -post jokes somewhere online where it can get a reliable score feedback -for jokes it generates and improve itself accordingly using RL. -