Training scispacy pipelines require recreating the vocab file #440

Hammad-NobleAI · 2022-07-14T22:36:42Z

I'm attempting to use your "en_core_sci_lg" pipeline to extract chemical entities from documents, and then using those entities as a basis to train Spacy's Entity Linker (as shown in this document). Here are the relevant portions of my code:

import spacy
import scispacy
nlp = spacy.load("en_core_sci_lg")

... prepare training documentation as Spacy specified in the form [tuples of form (text, {"links": (span.start, span.end), {qID: probability})]...

entity_linker = nlp.create_pipe("entity_linker", config={"incl_prior": False})

def create_kb(vocab):
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=200)

    for qid, desc in desc_dict.items():
        desc_doc = nlp(desc)
        desc_enc = desc_doc.vector
        kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)
    return kb

entity_linker.set_kb(create_kb)
nlp.add_pipe("entity_linker", last=True)

from random import random
from spacy.util import minibatch, compounding
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
with nlp.disable_pipes(*other_pipes):   # train only the entity_linker
        optimizer = nlp.begin_training() ## ERROR HERE
        for itn in range(500):   # 500 iterations takes about a minute to train on this small dataset
            random.shuffle(TRAIN_DOCS)
            batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))   # increasing batch size
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,
                    annotations,
                    drop=0.2,   # prevent overfitting
                    losses=losses,
                    sgd=optimizer,
                )
            if itn % 50 == 0:
                print(itn, "Losses", losses)   # print the training loss
print(itn, "Losses", losses)

When I get to the error line (commented towards the end of the code block), I get the following error:

RegistryError: [E893] Could not find function 'replace_tokenizer' in function registry 'callbacks'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy.copy_from_base_model.v1, spacy.models_and_pipes_with_nvtx_range.v1, spacy.models_with_nvtx_range.v1

I'm running on Mac OS 12.4, M1 Pro, 16 GB unified memory. Scispacy==0.5.0, spacy==3.2.4. Are Scispacy models compatible with this workflow, or is that something that hasn't/won't be implemented? Thanks in advance!

The text was updated successfully, but these errors were encountered:

dakinggg · 2022-07-16T20:32:29Z

Can you try adding a from scispacy.base_project_code import * to the top of your file?

Hammad-NobleAI · 2022-07-18T15:58:32Z

Thanks for getting back to me. I tried that, and it seems to have got beyond that issue now, but has led into this:

File ~/.pyenv/versions/3.10.5/envs/el-demo/lib/python3.10/site-packages/spacy/language.py:1249, in Language.begin_training(self, get_examples, sgd)
   1242 def begin_training(
   1243     self,
   1244     get_examples: Optional[Callable[[], Iterable[Example]]] = None,
   1245     *,
   1246     sgd: Optional[Optimizer] = None,
   1247 ) -> Optimizer:
   1248     warnings.warn(Warnings.W089, DeprecationWarning)
-> 1249     return self.initialize(get_examples, sgd=sgd)

File ~/.pyenv/versions/3.10.5/envs/el-demo/lib/python3.10/site-packages/spacy/language.py:1286, in Language.initialize(self, get_examples, sgd)
   1284     before_init(self)
   1285 try:
-> 1286     init_vocab(
   1287         self, data=I["vocab_data"], lookups=I["lookups"], vectors=I["vectors"]
   1288     )
...
     23 if require_exists and not location.exists():
---> 24     raise ValueError(f"Can't read file: {location}")
     25 return location

ValueError: Can't read file: project_data/vocab_lg.jsonl

dakinggg · 2022-07-18T18:03:23Z

Ok, I think you are working from an outdated example, because the begin_training function is deprecated (https://spacy.io/api/language#initialize). If you want to write your own training loop, you will probably need to look deeper into how spacy does it in the train CLI. That being said, you should probably use their config system and CLI for training as much as possible. Check out project.yml and the configs here https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson. All that being said, I think this is also a question about spacy, not scispacy, as I think you will get similar errors if you run your script using en_core_web_md, so further questions are probably better directed to the spacy folks. Feel free to reopen if it ends up being scispacy specific.

dakinggg · 2022-07-18T18:10:09Z

Edit: looks like the base spacy models don't have this issue, so it is something more specific. I think it might still be a question for the spacy folks, but first you should try using the config system and CLI.

dakinggg · 2022-07-18T18:30:38Z

If it turns out you do just need that vocab file to continue, you can probably recreate it from the en_core_sci_lg model somehow, but you can definitely also just create it the same way that we do. See the convert-lg command in our project.yml.

dakinggg · 2022-09-16T05:49:14Z

see #450 for a workaround

dakinggg closed this as completed Jul 18, 2022

dakinggg reopened this Jul 18, 2022

dakinggg added the bug Something isn't working label Sep 7, 2022

dakinggg changed the title ~~Training custom EL through Spacy's default approach~~ Training scispacy pipelines require recreating the vocab file Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training scispacy pipelines require recreating the vocab file #440

Training scispacy pipelines require recreating the vocab file #440

Hammad-NobleAI commented Jul 14, 2022 •

edited

Loading

dakinggg commented Jul 16, 2022

Hammad-NobleAI commented Jul 18, 2022

dakinggg commented Jul 18, 2022 •

edited

Loading

dakinggg commented Jul 18, 2022

dakinggg commented Jul 18, 2022

dakinggg commented Sep 16, 2022

Training scispacy pipelines require recreating the vocab file #440

Training scispacy pipelines require recreating the vocab file #440

Comments

Hammad-NobleAI commented Jul 14, 2022 • edited Loading

dakinggg commented Jul 16, 2022

Hammad-NobleAI commented Jul 18, 2022

dakinggg commented Jul 18, 2022 • edited Loading

dakinggg commented Jul 18, 2022

dakinggg commented Jul 18, 2022

dakinggg commented Sep 16, 2022

Hammad-NobleAI commented Jul 14, 2022 •

edited

Loading

dakinggg commented Jul 18, 2022 •

edited

Loading