Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train a language model on a jokes corpus #37

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Train a language model on a jokes corpus #37

wants to merge 3 commits into from

Conversation

pranoyr
Copy link

@pranoyr pranoyr commented May 18, 2018

I have successfully created a character level language model using the joke dataset.

@AlphaGit
Copy link

@pranoyr I’m really interested in your approach, specially because the model seems to be very simple (i.e. efficient) in regards to training. Did you have any good loss metrics? Was it able to generate sentences, and even better, funny sentences on its own?

Please note I’m not affiliated with Open AI’s research, I’m just genuinely curious about it.

@pranoyr
Copy link
Author

pranoyr commented May 19, 2018

When i took a small dataset, the model was fitting good and the loss was decreasing gradually. When trained on a large dataset, the text generation was satisfactory. I am trying out with more deeper models which can produce much more accurate and funny results.

@AlphaGit
Copy link

@pranoyr I made these very slight modifications in order to run it myself:

+import sys                                                  
 from pickle import load                                     
 from keras.models import load_model                         
 from keras.utils import to_categorical                      
@@ -34,4 +35,6 @@ model = load_model('model.h5')             
 # load the mapping                                          
 mapping = load(open('mapping.pkl', 'rb'))                   
 # test not in original                                      
-print(generate_seq(model, mapping, 10, 'hello worl', 20))   
+seed_text = sys.argv[1]                                     
+print("Seed text: ", seed_text)                             
+print(generate_seq(model, mapping, 10, seed_text, 140))     
diff --git a/prepare_data.py b/prepare_data.py               
index 0221669..bd9a632 100755                                
--- a/prepare_data.py                                        
+++ b/prepare_data.py                                        
@@ -1,7 +1,7 @@                                              
 # load doc into memory                                      
 def load_doc(filename):                                     
        # open the file as read only                         
-       file = open(filename, 'r')                           
+       file = open(filename, 'r', encoding='utf8')          
        # read all text                                      
        text = file.read()                                   
        # close the file                                     
@@ -11,7 +11,7 @@ def load_doc(filename):                    
 # save tokens to file, one dialog per line                  
 def save_doc(lines, filename):                              
        data = '\n'.join(lines)                              
-       file = open(filename, 'w')                           
+       file = open(filename, 'w', encoding='utf8')          
        file.write(data)                                     
        file.close()                                         
                                                             
diff --git a/train.py b/train.py                             
index b6a571d..f964019 100755                                
--- a/train.py                                               
+++ b/train.py                                               
@@ -10,7 +10,7 @@ from keras.layers import LSTM              
 # load doc into memory                                      
 def load_doc(filename):                                     
     # open the file as read only                            
-    file = open(filename, 'r')                              
+    file = open(filename, 'r', encoding='utf8')             
     # read all text                                         
     text = file.read()                                      
     # close the file                                        

I got the following results, which are quite interesting:


Seed text: do you know

do you know you want to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I ha


Seed text: how does a

how does a particular and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the ba


Seed text: what do you call a

what do you call a stated to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have


While it is definitely working, it seems to be hitting a very strong plateau where it gets stuck in particular repeating sentences. This is specially challenging, because some jokes have a very important component in repetition of certain phrases or words (to preserve context).

@pranoyr
Copy link
Author

pranoyr commented May 19, 2018

Yes. This was my issue in training large dataset. Here the model is under fitting and i am trying to make a big enough network for this dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants