Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pick_lstm_model parameters are too complicated to call #10

Open
FrankYFTang opened this issue Jan 22, 2021 · 9 comments
Open

pick_lstm_model parameters are too complicated to call #10

FrankYFTang opened this issue Jan 22, 2021 · 9 comments

Comments

@FrankYFTang
Copy link
Collaborator

I have the following simple program to see how to run all different models under

https://github.com/unicode-org/lstm_word_segmentation/tree/master/Models

It currently work for Thai_codepoints_exclusive_model4_heavy but I have problem to figure out what the value need to be passed in for other model

# Lint as: python3
from lstm_word_segmentation.word_segmenter import pick_lstm_model
import sys, getopt

"""
Read a file and output segmented results
"""

def main(argv):
   inputfile = ''
   outputfile = ''
   try:
     opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
   except getopt.GetoptError:
     print('test.py -i <inputfile> -o <outputfile>')
     sys.exit(2)
   for opt, arg in opts:
      if opt == '-h':
        print('test.py -i <inputfile> -o <outputfile>')
        sys.exit()
      elif opt in ("-i", "--ifile"):
        inputfile = arg
      elif opt in ("-o", "--ofile"):
        outputfile = arg
   print('Input file is "', inputfile)
   print('Output file is "', outputfile)

   file1 = open(inputfile, 'r')
   Lines = file1.readlines()

   word_segmenter = pick_lstm_model(model_name="Thai_codepoints_exclusive_model4_heavy",
                                    embedding="codepoints",
                                    train_data="exclusive BEST",
                                    eval_data="exclusive BEST")

   count = 0
   # Strips the newline character
   for line in Lines:
       line = line.strip()
       print(line)
       print(word_segmenter.segment_arbitrary_line(line))

if __name__ == "__main__":
    main(sys.argv[1:])

Could you specify what values should be used for embedding, train_data and eval_data for the other models?

Burmese_codepoints_exclusive_model4_heavy
Burmese_codepoints_exclusive_model5_heavy
Burmese_codepoints_exclusive_model7_heavy
Burmese_genvec1235_model4_heavy
Burmese_graphclust_model4_heavy
Burmese_graphclust_model5_heavy
Burmese_graphclust_model7_heavy
Thai_codepoints_exclusive_model4_heavy
Thai_codepoints_exclusive_model5_heavy
Thai_codepoints_exclusive_model7_heavy
Thai_genvec123_model5_heavy
Thai_graphclust_model4_heavy
Thai_graphclust_model5_heavy
Thai_graphclust_model7_heavy

or is there a simple way we can just have a simple function

get_lstm_model(model_name) on top of pick_lstm_model() and just fill the necessary parameter to call pick_lstm_model()

@SahandFarhoodi
Copy link
Collaborator

In this document under "input_name" I explain the relationship between the name of the models and hyperparameters. For pick_lstm_model it's actually much simpler: embedding should be the embedding that appears in name of the model, e.g. if you have codepoints in name of the model we need embedding="codepoints" and if we have graphclust in name of the model we need embedding="grapheme_clusters_tf". The choice of train_data and eval_data shouldn't be important if you are segmenting arbitrary lines (by calling segment_arbitrary_line function) which is what I see in your code. However, if you want to train and evaluate using BEST data or my.txt file, you need to set train_data and eval_data to appropriate values that I explained in the link above.

@SahandFarhoodi
Copy link
Collaborator

In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link, but I just left it there because I wasn't sure if that's the case.

@FrankYFTang
Copy link
Collaborator Author

FrankYFTang commented Jan 27, 2021 via email

@FrankYFTang
Copy link
Collaborator Author

FrankYFTang commented Jan 27, 2021 via email

@FrankYFTang
Copy link
Collaborator Author

FrankYFTang commented Jan 27, 2021 via email

@SahandFarhoodi
Copy link
Collaborator

SahandFarhoodi commented Jan 27, 2021

Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above.

So the python code that you ran was flawed (and I guess you get lower accuracy there), but whatever we had in the json files was up to date.

@sffc this should not affect our model performance in Rust, that's probably why we didn't spot it sooner.

@FrankYFTang
Copy link
Collaborator Author

FrankYFTang commented Jan 27, 2021 via email

@SahandFarhoodi
Copy link
Collaborator

SahandFarhoodi commented Jan 28, 2021

I made a commit that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that read/write dictionaries. Please see my commit.

I also updated our Google drive accordingly.

@FrankYFTang
Copy link
Collaborator Author

FrankYFTang commented Jan 29, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants