Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range #68

Open
Jibril-Frej opened this issue Mar 24, 2023 · 7 comments
Open

IndexError: list index out of range #68

Jibril-Frej opened this issue Mar 24, 2023 · 7 comments

Comments

@Jibril-Frej
Copy link

Some strings make the annotate function crash:

import spacy
from spacy.matcher import PhraseMatcher

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

If you run the code above you should get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[69], line 1
----> 1 skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
    123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
    124     text_obj, self.matchers['abv_matcher'])
    126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
    127     text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
    130     text_obj, self.matchers['low_form_matcher'])
    132 skills_on_token = self.skill_getters.get_token_match_skills(
    133     text_obj, self.matchers['token_matcher'])
    134 full_sk = skills_full + skills_abv

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
    329 for match_id, start, end in matcher(doc):
    330     id_ = matcher.vocab.strings[match_id]
--> 332     if text_obj[start].is_matchable:
    333         skills.append({'skill_id': id_+'_lowSurf',
    334                        'doc_node_value': str(doc[start:end]),
    335                        'doc_node_id': list(range(start, end)),
    336                        'type': 'lw_surf'})
    338 return skills, text_obj

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
    277 def __getitem__(
    278     self,
    279     index: int
    280 ) -> Word:
    281     """To get the word at the specified position by index
    282 
    283     Parameters
   (...)
    302     english
    303     """
--> 304     return self.list_words[index]

IndexError: list index out of range
@ManalIrfan
Copy link

Running into the same problem. Any way to maybe sanitize the string to not run into this problem?

@ManalIrfan
Copy link

ManalIrfan commented Mar 29, 2023

Seems to be a problem with some unicode characters. Encoding to ascii and then decoding back to utf-8 works.

import unicodedata
...

text = "My Random Character text"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )

@Jibril-Frej
Copy link
Author

I am still running in the same issue using the encoding/decoding:

import spacy
from spacy.matcher import PhraseMatcher
import unicodedata

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

text = "Learn how to become a professional wedding makeup artist"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )

I still get the same error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 4
      2 text = "Learn how to become a professional wedding makeup artist"
      3 text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
----> 4 annotations = skill_extractor.annotate(text )

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
    123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
    124     text_obj, self.matchers['abv_matcher'])
    126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
    127     text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
    130     text_obj, self.matchers['low_form_matcher'])
    132 skills_on_token = self.skill_getters.get_token_match_skills(
    133     text_obj, self.matchers['token_matcher'])
    134 full_sk = skills_full + skills_abv

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
    329 for match_id, start, end in matcher(doc):
    330     id_ = matcher.vocab.strings[match_id]
--> 332     if text_obj[start].is_matchable:
    333         skills.append({'skill_id': id_+'_lowSurf',
    334                        'doc_node_value': str(doc[start:end]),
    335                        'doc_node_id': list(range(start, end)),
    336                        'type': 'lw_surf'})
    338 return skills, text_obj

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
    277 def __getitem__(
    278     self,
    279     index: int
    280 ) -> Word:
    281     """To get the word at the specified position by index
    282 
    283     Parameters
   (...)
    302     english
    303     """
--> 304     return self.list_words[index]

IndexError: list index out of range

@chrisho51
Copy link

chrisho51 commented Apr 24, 2023

Facing this issue as well. Did you ever find a solve @Jibril-Frej ?

@Jibril-Frej
Copy link
Author

Jibril-Frej commented Apr 25, 2023

No real fix. I just do a try catch.

try:
    skill_extractor.annotate(target_text)
except IndexError:
    pass
except ValueError:
    pass

@AJeschor
Copy link

I am also encountering this error. I would really like to use SkillNER but this issue is really preventing me from being able to do so.

@Yongwoo-Eg-Kim
Copy link

Yongwoo-Eg-Kim commented Nov 8, 2024

Hello I found the solution. I think that the package has not been updated.

First, please find your 'matcher_class.py' in your package directory
"YOUR ENVIRONMENT and PACKAGE PATH/skillNer/matcher_class.py"

please modify this function "def get_low_match_skills" like https://github.com/AnasAito/SkillNER/blob/master/skillNer/matcher_class.py :

add:
# handle skill in the end of phrase
start = start if start < len(text_obj) else start - 1

complete function :

def get_low_match_skills(
    self,
    text_obj: Text,
    matcher
):

    skills = []
    doc = self.nlp(text_obj.stemmed())

    for match_id, start, end in matcher(doc):
        id_ = matcher.vocab.strings[match_id]
        # handle skill in the end of phrase
        start = start if start < len(text_obj) else start - 1
        if text_obj[start].is_matchable:
            skills.append({'skill_id': id_+'_lowSurf',
                           'doc_node_value': str(doc[start:end]),
                           'doc_node_id': list(range(start, end)),
                           'type': 'lw_surf'})

    return skills, text_obj

or you can re-install package using git!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants