We need to preprocess the language in these files to take out punctuation and stopwords and perform stemming. To do this, we need three python functions: stem - this will take in a word and then use the nltk library (https://www.nltk.org/) to stem the word and then return the stemmed version of the word. There are two common stemmers called the Snowball and Porter stemmers. It doesn't really matter which you use, but if you are able to make the choice a parameter that can be passed into the function that would be great. Here is some more information about stemming with the library: https://pythonspot.com/nltk-stemming/. The idea of stemming is that you want all words that have the same meaning regardless of tense and that sort of thing to be counted the same in your representation. So, for example: run and running would be one vocabulary word and not two separate ones.
remove_stopwords - this will take in a string (i.e., the sentence) and a python list of words. It will remove the words in the list from the sentence string and then return the result
remove_punctuation - this will take in a string, apply a regex (or something equivalent) to remove punctuation and special symbols and then return the result
from build_datafiles import remove_stopwords, remove_num_punct, stem
remove_stopwords(list_of_tokens, list_of_stopwords) # tokens are strings representing single words
remove_num_punct(list_of_tokens) # tokens are strings representing single words
stem(list_of_tokens, stemming_algorithm) # tokens are strings representing single words; 'porter' OR 'snowball'
We need to combine all of the utterances in the files to make a list of all possible combinations of high and low level requirements. To do this, we should use a python script (or make a function) to concatenate all combinations of the file contents and write them to disk as two new data files. The first new file will contain the names of the files that were concatenated to form the new utterance on each line of the second file, which will contain one concatenated pair of file contents per line. This is the same format as the HighLevelRequirements and HighLevelRequirementNames files, except the new files will contain all possible combinations of file names. We also need to create a new oracle file for our upcoming experiments. To do that, we need a third file that has either a 0 or a 1 on each line. If the combination of file names in the respective line in the names file is in the original oracle file (i.e., the files are related), there should be a "1" on the line. Otherwise, there should be a 0. The outcome will be a file with a lot of zeros on each line and every once in a while there will be a one on a line, which indicates a combination of files that are related to one another.
# from within virtual env
# must pass folder paths as arguments
python3 build_datafiles.py path_to_cc path_to_uc path_to_oracle dataset_name
We need to build a configurable vocabulary from the files. That can be done with another function: build_vocabulary - this takes in two things. First, a list of strings. Second, a number we will call "k". The function should go through all of the sentences, parse them into words, and count the number of times each word appears in the list of sentences. The function should then order the words by frequency and build a dictionary that has the word "UNK" as the first key with value 0. Then each word should be added in descending order of frequency until you have added k words to the dictionary. The function should then return that dictionary
from utils import build_vocabulary
build_vocabulary(list_of_strs, top_k_value)
We need to build a decode and an encode function from the vocabulary that can take the vocabulary and a sentence or an encoded sentence and then produce the other:
encode - accepts a sentence string and the vocabulary and returns a python list of size k where each element of the list is a 1 or a 0 depending on if the index of the element corresponds to a vocabulary word that is in the sentence.
decode - accepts an encoded sentence list and the vocabulary and returns the original sentence (potentially out of order)
We need to extend the encode function to put a value computed by another function as the element instead of just a 1. I'm not sure off the top of my head how to do this in Python, but in C# or JS you would use a function pointer to supply an arbitrary function to compute the value, which will let us implement term-weighting. This will become an additional hyper parameter for our model. I'm not sure everyone's comfort level with Python, but I have to believe it supports this. If no one has an idea how to do it, we can collaborate on a solution.
cd path_to_NLPResearch/NLPResearch/Project_2/
pipenv install # installs dependencies specified in Pipfile
pipenv shell # launches virtual environment
# you only need to run command once within the virtual env
python -m ipykernel install --user --name=my-virtualenv-name