-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train a language model on a jokes corpus #37
base: master
Are you sure you want to change the base?
Conversation
@pranoyr I’m really interested in your approach, specially because the model seems to be very simple (i.e. efficient) in regards to training. Did you have any good loss metrics? Was it able to generate sentences, and even better, funny sentences on its own? Please note I’m not affiliated with Open AI’s research, I’m just genuinely curious about it. |
When i took a small dataset, the model was fitting good and the loss was decreasing gradually. When trained on a large dataset, the text generation was satisfactory. I am trying out with more deeper models which can produce much more accurate and funny results. |
@pranoyr I made these very slight modifications in order to run it myself: +import sys
from pickle import load
from keras.models import load_model
from keras.utils import to_categorical
@@ -34,4 +35,6 @@ model = load_model('model.h5')
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))
# test not in original
-print(generate_seq(model, mapping, 10, 'hello worl', 20))
+seed_text = sys.argv[1]
+print("Seed text: ", seed_text)
+print(generate_seq(model, mapping, 10, seed_text, 140))
diff --git a/prepare_data.py b/prepare_data.py
index 0221669..bd9a632 100755
--- a/prepare_data.py
+++ b/prepare_data.py
@@ -1,7 +1,7 @@
# load doc into memory
def load_doc(filename):
# open the file as read only
- file = open(filename, 'r')
+ file = open(filename, 'r', encoding='utf8')
# read all text
text = file.read()
# close the file
@@ -11,7 +11,7 @@ def load_doc(filename):
# save tokens to file, one dialog per line
def save_doc(lines, filename):
data = '\n'.join(lines)
- file = open(filename, 'w')
+ file = open(filename, 'w', encoding='utf8')
file.write(data)
file.close()
diff --git a/train.py b/train.py
index b6a571d..f964019 100755
--- a/train.py
+++ b/train.py
@@ -10,7 +10,7 @@ from keras.layers import LSTM
# load doc into memory
def load_doc(filename):
# open the file as read only
- file = open(filename, 'r')
+ file = open(filename, 'r', encoding='utf8')
# read all text
text = file.read()
# close the file I got the following results, which are quite interesting:
do you know you want to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I ha
how does a particular and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the ba
what do you call a stated to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have While it is definitely working, it seems to be hitting a very strong plateau where it gets stuck in particular repeating sentences. This is specially challenging, because some jokes have a very important component in repetition of certain phrases or words (to preserve context). |
Yes. This was my issue in training large dataset. Here the model is under fitting and i am trying to make a big enough network for this dataset. |
I have successfully created a character level language model using the joke dataset.