Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

item['fields'] in json format #22

Open
minottic opened this issue Jul 8, 2022 · 1 comment
Open

item['fields'] in json format #22

minottic opened this issue Jul 8, 2022 · 1 comment

Comments

@minottic
Copy link

minottic commented Jul 8, 2022

If I understand the logic right, I think whatever is in 'fields' of 'item' is converted to string, cleaned, and all the composing words are returned in an array.

def preprocessItemText(item):
"""
extract the meaningful fields from the item (which is passed in as a pandas dataframe row)
Convert them in a string, using json.dumps
and run all the preprocess steps as highlighted in the PaNOSC search scoring report
"""
# check if input item is a string
# if it is not, we assume that it is a panda dataframe row
outstring = item if isinstance(item,str) else json.dumps(item['fields'])
outstring = outstring.lower()
outstring = removePunctuation(outstring,punctuation_symbols)
outstring = removeStopWords(outstring)
outstring = removeApostrophy(outstring)
outstring = removeUnneededSpaces(outstring)
outstring = convertSentence2Numbers(outstring)
outstring = removeStopWords(outstring)
outstring = stemmatize(outstring,stemmer)
outstring = removePunctuation(outstring,punctuation_symbols)
outstring = removeUnneededSpaces(outstring)
outstring = removeShortWords(outstring)
return outstring.split(' ')

If this is correct (not too sure if I understood correctly though), I don't see the value of allowing item['fields'] to be a dictionary and not simply restricting it to a list.

@nitrosx
Copy link
Collaborator

nitrosx commented Jul 8, 2022

@minottic : you are correct. The field "fields" can be a string or a dictionary.
I decided to leave it up to the user how to provide it to the system.
My thinking is that some users (like you) would like to pass in a string (maybe preprocessed in some way or filtered) some others (like me) would like to maintain the structure of the information that is used to score.
The system accept both. I would like to keep it that way, but I understand that it might be confusing when reading the documentation.
If you have any suggestion on how changes that would clarify how it works, please do let me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants