item['fields'] in json format #22

minottic · 2022-07-08T14:41:26Z

If I understand the logic right, I think whatever is in 'fields' of 'item' is converted to string, cleaned, and all the composing words are returned in an array.

panosc-search-scoring/app/ml/preprocessItemsText.py

Lines 96 to 119 in 5d35342

    
           def preprocessItemText(item): 
        
             """ 
        
             extract the meaningful fields from the item (which is passed in as a pandas dataframe row) 
        
             Convert them in a string, using json.dumps 
        
             and run all the preprocess steps as highlighted in the PaNOSC search scoring report 
        
             """ 
        
             # check if input item is a string 
        
             # if it is not, we assume that it is a panda dataframe row 
        
             outstring = item if isinstance(item,str) else json.dumps(item['fields']) 
        
             outstring = outstring.lower() 
        
             outstring = removePunctuation(outstring,punctuation_symbols) 
        
             outstring = removeStopWords(outstring) 
        
             outstring = removeApostrophy(outstring) 
        
             outstring = removeUnneededSpaces(outstring) 
        
             outstring = convertSentence2Numbers(outstring) 
        
             outstring = removeStopWords(outstring) 
        
             outstring = stemmatize(outstring,stemmer) 
        
             outstring = removePunctuation(outstring,punctuation_symbols) 
        
             outstring = removeUnneededSpaces(outstring) 
        
             outstring = removeShortWords(outstring) 
        
             return outstring.split(' ')

If this is correct (not too sure if I understood correctly though), I don't see the value of allowing item['fields'] to be a dictionary and not simply restricting it to a list.

nitrosx · 2022-07-08T15:04:29Z

@minottic : you are correct. The field "fields" can be a string or a dictionary.
I decided to leave it up to the user how to provide it to the system.
My thinking is that some users (like you) would like to pass in a string (maybe preprocessed in some way or filtered) some others (like me) would like to maintain the structure of the information that is used to score.
The system accept both. I would like to keep it that way, but I understand that it might be confusing when reading the documentation.
If you have any suggestion on how changes that would clarify how it works, please do let me know

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

item['fields'] in json format #22

item['fields'] in json format #22

minottic commented Jul 8, 2022 •

edited

Loading

nitrosx commented Jul 8, 2022

item['fields'] in json format #22

item['fields'] in json format #22

Comments

minottic commented Jul 8, 2022 • edited Loading

nitrosx commented Jul 8, 2022

minottic commented Jul 8, 2022 •

edited

Loading