IndexError: list index out of range #68

Jibril-Frej · 2023-03-24T14:01:17Z

Some strings make the annotate function crash:

import spacy
from spacy.matcher import PhraseMatcher

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

If you run the code above you should get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[69], line 1
----> 1 skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
    123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
    124     text_obj, self.matchers['abv_matcher'])
    126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
    127     text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
    130     text_obj, self.matchers['low_form_matcher'])
    132 skills_on_token = self.skill_getters.get_token_match_skills(
    133     text_obj, self.matchers['token_matcher'])
    134 full_sk = skills_full + skills_abv

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
    329 for match_id, start, end in matcher(doc):
    330     id_ = matcher.vocab.strings[match_id]
--> 332     if text_obj[start].is_matchable:
    333         skills.append({'skill_id': id_+'_lowSurf',
    334                        'doc_node_value': str(doc[start:end]),
    335                        'doc_node_id': list(range(start, end)),
    336                        'type': 'lw_surf'})
    338 return skills, text_obj

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
    277 def __getitem__(
    278     self,
    279     index: int
    280 ) -> Word:
    281     """To get the word at the specified position by index
    282 
    283     Parameters
   (...)
    302     english
    303     """
--> 304     return self.list_words[index]

IndexError: list index out of range

ManalIrfan · 2023-03-28T23:54:45Z

Running into the same problem. Any way to maybe sanitize the string to not run into this problem?

ManalIrfan · 2023-03-29T00:22:41Z

Seems to be a problem with some unicode characters. Encoding to ascii and then decoding back to utf-8 works.

import unicodedata
...

text = "My Random Character text"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )

Jibril-Frej · 2023-03-30T06:28:38Z

I am still running in the same issue using the encoding/decoding:

import spacy
from spacy.matcher import PhraseMatcher
import unicodedata

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

text = "Learn how to become a professional wedding makeup artist"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )

I still get the same error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 4
      2 text = "Learn how to become a professional wedding makeup artist"
      3 text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
----> 4 annotations = skill_extractor.annotate(text )

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
    123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
    124     text_obj, self.matchers['abv_matcher'])
    126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
    127     text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
    130     text_obj, self.matchers['low_form_matcher'])
    132 skills_on_token = self.skill_getters.get_token_match_skills(
    133     text_obj, self.matchers['token_matcher'])
    134 full_sk = skills_full + skills_abv

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
    329 for match_id, start, end in matcher(doc):
    330     id_ = matcher.vocab.strings[match_id]
--> 332     if text_obj[start].is_matchable:
    333         skills.append({'skill_id': id_+'_lowSurf',
    334                        'doc_node_value': str(doc[start:end]),
    335                        'doc_node_id': list(range(start, end)),
    336                        'type': 'lw_surf'})
    338 return skills, text_obj

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
    277 def __getitem__(
    278     self,
    279     index: int
    280 ) -> Word:
    281     """To get the word at the specified position by index
    282 
    283     Parameters
   (...)
    302     english
    303     """
--> 304     return self.list_words[index]

IndexError: list index out of range

chrisho51 · 2023-04-24T20:11:52Z

Facing this issue as well. Did you ever find a solve @Jibril-Frej ?

Jibril-Frej · 2023-04-25T06:25:08Z

No real fix. I just do a try catch.

try:
    skill_extractor.annotate(target_text)
except IndexError:
    pass
except ValueError:
    pass

AJeschor · 2024-03-12T21:06:45Z

I am also encountering this error. I would really like to use SkillNER but this issue is really preventing me from being able to do so.

Yongwoo-Eg-Kim · 2024-11-08T14:49:58Z

Hello I found the solution. I think that the package has not been updated.

First, please find your 'matcher_class.py' in your package directory
"YOUR ENVIRONMENT and PACKAGE PATH/skillNer/matcher_class.py"

please modify this function "def get_low_match_skills" like https://github.com/AnasAito/SkillNER/blob/master/skillNer/matcher_class.py :

add:
# handle skill in the end of phrase
start = start if start < len(text_obj) else start - 1

complete function :

def get_low_match_skills(
    self,
    text_obj: Text,
    matcher
):

    skills = []
    doc = self.nlp(text_obj.stemmed())

    for match_id, start, end in matcher(doc):
        id_ = matcher.vocab.strings[match_id]
        # handle skill in the end of phrase
        start = start if start < len(text_obj) else start - 1
        if text_obj[start].is_matchable:
            skills.append({'skill_id': id_+'_lowSurf',
                           'doc_node_value': str(doc[start:end]),
                           'doc_node_id': list(range(start, end)),
                           'type': 'lw_surf'})

    return skills, text_obj

or you can re-install package using git!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: list index out of range #68

IndexError: list index out of range #68

Jibril-Frej commented Mar 24, 2023

ManalIrfan commented Mar 28, 2023

ManalIrfan commented Mar 29, 2023 •

edited

Loading

Jibril-Frej commented Mar 30, 2023

chrisho51 commented Apr 24, 2023 •

edited

Loading

Jibril-Frej commented Apr 25, 2023 •

edited

Loading

AJeschor commented Mar 12, 2024

Yongwoo-Eg-Kim commented Nov 8, 2024 •

edited

Loading

IndexError: list index out of range #68

IndexError: list index out of range #68

Comments

Jibril-Frej commented Mar 24, 2023

ManalIrfan commented Mar 28, 2023

ManalIrfan commented Mar 29, 2023 • edited Loading

Jibril-Frej commented Mar 30, 2023

chrisho51 commented Apr 24, 2023 • edited Loading

Jibril-Frej commented Apr 25, 2023 • edited Loading

AJeschor commented Mar 12, 2024

Yongwoo-Eg-Kim commented Nov 8, 2024 • edited Loading

ManalIrfan commented Mar 29, 2023 •

edited

Loading

chrisho51 commented Apr 24, 2023 •

edited

Loading

Jibril-Frej commented Apr 25, 2023 •

edited

Loading

Yongwoo-Eg-Kim commented Nov 8, 2024 •

edited

Loading