Train a language model on a jokes corpus #37

pranoyr · 2018-05-18T09:36:24Z

I have successfully created a character level language model using the joke dataset.

AlphaGit · 2018-05-18T22:43:02Z

@pranoyr I’m really interested in your approach, specially because the model seems to be very simple (i.e. efficient) in regards to training. Did you have any good loss metrics? Was it able to generate sentences, and even better, funny sentences on its own?

Please note I’m not affiliated with Open AI’s research, I’m just genuinely curious about it.

pranoyr · 2018-05-19T03:35:07Z

When i took a small dataset, the model was fitting good and the loss was decreasing gradually. When trained on a large dataset, the text generation was satisfactory. I am trying out with more deeper models which can produce much more accurate and funny results.

AlphaGit · 2018-05-19T16:27:06Z

@pranoyr I made these very slight modifications in order to run it myself:

+import sys                                                  
 from pickle import load                                     
 from keras.models import load_model                         
 from keras.utils import to_categorical                      
@@ -34,4 +35,6 @@ model = load_model('model.h5')             
 # load the mapping                                          
 mapping = load(open('mapping.pkl', 'rb'))                   
 # test not in original                                      
-print(generate_seq(model, mapping, 10, 'hello worl', 20))   
+seed_text = sys.argv[1]                                     
+print("Seed text: ", seed_text)                             
+print(generate_seq(model, mapping, 10, seed_text, 140))     
diff --git a/prepare_data.py b/prepare_data.py               
index 0221669..bd9a632 100755                                
--- a/prepare_data.py                                        
+++ b/prepare_data.py                                        
@@ -1,7 +1,7 @@                                              
 # load doc into memory                                      
 def load_doc(filename):                                     
        # open the file as read only                         
-       file = open(filename, 'r')                           
+       file = open(filename, 'r', encoding='utf8')          
        # read all text                                      
        text = file.read()                                   
        # close the file                                     
@@ -11,7 +11,7 @@ def load_doc(filename):                    
 # save tokens to file, one dialog per line                  
 def save_doc(lines, filename):                              
        data = '\n'.join(lines)                              
-       file = open(filename, 'w')                           
+       file = open(filename, 'w', encoding='utf8')          
        file.write(data)                                     
        file.close()                                         
                                                             
diff --git a/train.py b/train.py                             
index b6a571d..f964019 100755                                
--- a/train.py                                               
+++ b/train.py                                               
@@ -10,7 +10,7 @@ from keras.layers import LSTM              
 # load doc into memory                                      
 def load_doc(filename):                                     
     # open the file as read only                            
-    file = open(filename, 'r')                              
+    file = open(filename, 'r', encoding='utf8')             
     # read all text                                         
     text = file.read()                                      
     # close the file

I got the following results, which are quite interesting:

Seed text: do you know

do you know you want to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I ha

Seed text: how does a

how does a particular and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the ba

Seed text: what do you call a

what do you call a stated to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have

While it is definitely working, it seems to be hitting a very strong plateau where it gets stuck in particular repeating sentences. This is specially challenging, because some jokes have a very important component in repetition of certain phrases or words (to preserve context).

pranoyr · 2018-05-19T16:32:37Z

Yes. This was my issue in training large dataset. Here the model is under fitting and i am trying to make a big enough network for this dataset.

pranoyr added 3 commits May 16, 2018 15:49

Update funnybot.html

6cafb72

Update funnybot.html

cc8c0cf

Update funnybot.html

4b06c2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train a language model on a jokes corpus #37

Train a language model on a jokes corpus #37

pranoyr commented May 18, 2018

AlphaGit commented May 18, 2018

pranoyr commented May 19, 2018

AlphaGit commented May 19, 2018

pranoyr commented May 19, 2018

Train a language model on a jokes corpus #37

Are you sure you want to change the base?

Train a language model on a jokes corpus #37

Conversation

pranoyr commented May 18, 2018

AlphaGit commented May 18, 2018

pranoyr commented May 19, 2018

AlphaGit commented May 19, 2018

pranoyr commented May 19, 2018