Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newer versions of Spacy transformer model backends failing #246

Open
Louis-Paul-Bowman opened this issue Aug 6, 2024 · 4 comments
Open

Comments

@Louis-Paul-Bowman
Copy link

I use spacy's transformer model for other purposes (such as NER), so re-using the same model made sense.
Looks like Spacy made some tweaks to their syntax which are breaking KeyBERT's spacy backend.

Sample code:

from keybert import KeyBERT
from spacy import load

nlp = load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
kw_model = KeyBERT(model=nlp)

text = "This is a test sentence."

keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=1, use_mmr=True)
print(keywords)

Expected behavior:
prints [("test", ...)]

Observed behavior:

Traceback (most recent call last):
  File "...\anaconda3\envs\env\lib\site-packages\keybert\backend\_spacy.py", line 84, in embed
    self.embedding_model(doc)._.trf_data.tensors[-1][0].tolist()
AttributeError: 'DocTransformerOutput' object has no attribute 'tensors'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...\test.py", line 9, in <module>
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=1, use_mmr=True)  
  File "...\envs\env\lib\site-packages\keybert\_model.py", line 195, in extract_keywords
    doc_embeddings = self.model.embed(docs)
  File "...\envs\env\lib\site-packages\keybert\backend\_spacy.py", line 88, in embed
    self.embedding_model("An empty document")
AttributeError: 'DocTransformerOutput' object has no attribute 'tensors'

Package versions:
cupy-cuda11x 12.3.0
curated-tokenizers 0.0.9
curated-transformers 0.1.1
en-core-web-trf 3.7.3
keybert 0.8.5
keyphrase-vectorizers 0.0.13
safetensors 0.4.4
scikit-learn 1.5.1
scipy 1.13.1
sentence-transformers 3.0.1
spacy 3.7.5
spacy-alignments 0.9.1
spacy-curated-transformers 0.2.2
spacy-legacy 3.0.12
spacy-loggers 1.0.5
spacy-transformers 1.3.5
thinc 8.2.5
tokenizers 0.15.2
transformers 4.36.2

@MaartenGr
Copy link
Owner

Thank you for sharing this issue! If I'm not mistaken, it seems this is a result of an updated version of SpaCy. I believe there should be an additional check here to see which version of SpaCy is being used and update it to then use DocTransformerOutput. If you are interested, a PR would be great. If you do not have the time, I can start working on it.

@Louis-Paul-Bowman
Copy link
Author

Louis-Paul-Bowman commented Aug 7, 2024

I'll give it a look.
From what I can tell easiest solution is probably just to check the spacy version / existence of curated-transformers, set a flag, then in embed replace ._.trf_tensors... with .last_hidden_layer_state

https://spacy.io/api/curatedtransformer#doctransformeroutput-lasthiddenlayerstate

@Louis-Paul-Bowman
Copy link
Author

Hello again.
I tinkered with it for as long as I had time for today, but didn't make a PR because while the code runs I don't think its functioning as intended. I think I may be missing some of your original logic, or maybe the new curated-transformers have gotten rid of a special '<s>' token which was being used as the document embedding in the final layer, but in my minimal example the document embedding (first token of final layer) is all zeroes (I think because I have a line break? unknown.)

Some notes:
As indicated in your linked issue, spacy moved to curated-transformers which has changed the properties of the transformer output. The new one has a few key properties:
-.last_hidden_layer_state PER-TOKEN tensors of the final layer, always present
-.all_hidden_layer_states the full model activations, only available if setting all_layer_outputs=True (on init, or with nlp.select_pipe("transformer"))
-.embedding_layer unclear how this differs from last_hidden_layer_state, the first token in my case was also all zeroes (and is also only available when all_layer_outputs is True)

_spacy.py

import numpy as np
from tqdm import tqdm
from typing import List
from packaging import version
from spacy import __version__ as spacy_version
from keybert.backend import BaseEmbedder


class SpacyBackend(BaseEmbedder):
    """Spacy embedding model

    The Spacy embedding model used for generating document and
    word embeddings.

    Arguments:
        embedding_model: A spacy embedding model

    Usage:

    To create a Spacy backend, you need to create an nlp object and
    pass it through this backend:

    ```python
    import spacy
    from keybert.backend import SpacyBackend

    nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
    spacy_model = SpacyBackend(nlp)
    ```

    To load in a transformer model use the following:

    ```python
    import spacy
    from thinc.api import set_gpu_allocator, require_gpu
    from keybert.backend import SpacyBackend

    nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
    set_gpu_allocator("pytorch")
    require_gpu(0)
    spacy_model = SpacyBackend(nlp)
    ```

    If you run into gpu/memory-issues, please use:

    ```python
    import spacy
    from keybert.backend import SpacyBackend

    spacy.prefer_gpu()
    nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
    spacy_model = SpacyBackend(nlp)
    ```
    """

    def __init__(self, embedding_model):
        super().__init__()

        self.curated_transformers = False
        
        if "spacy" in str(type(embedding_model)):
            self.embedding_model = embedding_model
            if "transformer" in self.embedding_model.component_names:
                if version.parse(spacy_version) >= version.parse("3.7.0"):
                    self.curated_transformers = True   
        else:
            raise ValueError(
                "Please select a correct Spacy model by either using a string such as 'en_core_web_md' "
                "or create a nlp model using: `nlp = spacy.load('en_core_web_md')"
            )

    def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
        """Embed a list of n documents/words into an n-dimensional
        matrix of embeddings

        Arguments:
            documents: A list of documents or words to be embedded
            verbose: Controls the verbosity of the process

        Returns:
            Document/words embeddings with shape (n, m) with `n` documents/words
            that each have an embeddings size of `m`
        """

        # Extract embeddings from a transformer model
        if "transformer" in self.embedding_model.component_names:
            embeddings = []
            for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
                try:
                    if self.curated_transformers:
                        embedding = (
                            self.embedding_model(doc)._.trf_data.last_hidden_layer_state.data[0].tolist()
                        )
                    else:
                        embedding = (
                            self.embedding_model(doc)._.trf_data.tensors[-1][0].tolist()
                        )
                except:
                    if self.curated_transformers:
                        embedding = (
                            self.embedding_model("An empty document")
                            ._.last_hidden_layer_state.data[0]
                            .tolist()
                        )
                    else:
                        embedding = (
                            self.embedding_model("An empty document")
                            ._.trf_data.tensors[-1][0]
                            .tolist()
                        )
                embeddings.append(embedding)
            embeddings = np.array(embeddings)

        # Extract embeddings from a general spacy model
        else:
            embeddings = []
            for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
                try:
                    vector = self.embedding_model(doc).vector
                except ValueError:
                    vector = self.embedding_model("An empty document").vector
                embeddings.append(vector)
            embeddings = np.array(embeddings)

        return embeddings

my test file

from keybert import KeyBERT
from spacy import load, require_gpu


require_gpu()
nlp = load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
kw_model = KeyBERT(model=nlp)


test_text = """
Sherlock Holmes (/ˈʃɜːrlɒk ˈhoʊmz/) is a fictional detective created by British author Arthur Conan Doyle. Referring to himself as a "consulting detective" in his stories, Holmes is known for his proficiency with observation, deduction, forensic science and logical reasoning that borders on the fantastic, which he employs when investigating cases for a wide variety of clients, including Scotland Yard.
The character Sherlock Holmes first appeared in print in 1887's A Study in Scarlet. His popularity became widespread with the first series of short stories in The Strand Magazine, beginning with "A Scandal in Bohemia" in 1891; additional tales appeared from then until 1927, eventually totalling four novels and 56 short stories. All but one are set in the Victorian or Edwardian eras between 1880 and 1914. Most are narrated by the character of Holmes's friend and biographer, Dr. John H. Watson, who usually accompanies Holmes during his investigations and often shares quarters with him at the address of 221B Baker Street, London, where many of the stories begin.
Though not the first fictional detective, Sherlock Holmes is arguably the best-known. By the 1990s, over 25,000 stage adaptations, films, television productions, and publications were already featuring the detective, and Guinness World Records lists him as the most portrayed human literary character in film and television history. Holmes's popularity and fame are such that many have believed him to be not a fictional character but an actual individual; numerous literary and fan societies have been founded on this pretence. Avid readers of the Holmes stories helped create the modern practice of fandom. The character and stories have had a profound and lasting effect on mystery writing and popular culture as a whole, with the original tales, as well as thousands written by authors other than Conan Doyle, being adapted into stage and radio plays, television, films, video games, and other media for over one hundred years.
Edgar Allan Poe's C. Auguste Dupin is generally acknowledged as the first detective in fiction and served as the prototype for many later characters, including Holmes. Conan Doyle once wrote, "Each [of Poe's detective stories] is a root from which a whole literature has developed ... Where was the detective story until Poe breathed the breath of life into it?" Similarly, the stories of Émile Gaboriau's Monsieur Lecoq were extremely popular at the time Conan Doyle began writing Holmes, and Holmes's speech and behaviour sometimes follow those of Lecoq. Doyle has his main characters discuss these literary antecedents near the beginning of A Study in Scarlet, which is set soon after Watson is first introduced to Holmes. Watson attempts to compliment Holmes by comparing him to Dupin, to which Holmes replies that he found Dupin to be "a very inferior fellow" and Lecoq to be "a miserable bungler".
Conan Doyle repeatedly said that Holmes was inspired by the real-life figure of Joseph Bell, a surgeon at the Royal Infirmary of Edinburgh, whom Conan Doyle met in 1877 and had worked for as a clerk. Like Holmes, Bell was noted for drawing broad conclusions from minute observations. However, he later wrote to Conan Doyle: "You are yourself Sherlock Holmes and well you know it". Sir Henry Littlejohn, Chair of Medical Jurisprudence at the University of Edinburgh Medical School, is also cited as an inspiration for Holmes. Littlejohn, who was also Police Surgeon and Medical Officer of Health in Edinburgh, provided Conan Doyle with a link between medical investigation and the detection of crime.
Other possible inspirations have been proposed, though never acknowledged by Doyle, such as Maximilien Heller, by French author Henry Cauvain. In this 1871 novel (sixteen years before the first appearance of Sherlock Holmes), Henry Cauvain imagined a depressed, anti-social, opium-smoking polymath detective, operating in Paris. It is not known if Conan Doyle read the novel, but he was fluent in French. Similarly, Michael Harrison suggested that a German self-styled "consulting detective" named Walter Scherer may have been the model for Holmes.
Details of Sherlock Holmes' life in Conan Doyle's stories are scarce and often vague. Nevertheless, mentions of his early life and extended family paint a loose biographical picture of the detective.
A statement of Holmes' age in "His Last Bow" places his year of birth at 1854; the story, set in August 1914, describes him as sixty years of age. His parents are not mentioned, although Holmes mentions that his "ancestors" were "country squires". In "The Adventure of the Greek Interpreter", he claims that his grandmother was sister to the French artist Vernet, without clarifying whether this was Claude Joseph, Carle, or Horace Vernet. Holmes' brother Mycroft, seven years his senior, is a government official. Mycroft has a unique civil service position as a kind of human database for all aspects of government policy. Sherlock describes his brother as the more intelligent of the two, but notes that Mycroft lacks any interest in physical investigation, preferring to spend his time at the Diogenes Club.
Holmes says that he first developed his methods of deduction as an undergraduate; his earliest cases, which he pursued as an amateur, came from his fellow university students. A meeting with a classmate's father led him to adopt detection as a profession.
In the first Holmes tale, A Study in Scarlet, financial difficulties lead Holmes and Dr. Watson to share rooms together at 221B Baker Street, London. Their residence is maintained by their landlady, Mrs. Hudson. Holmes works as a detective for twenty-three years, with Watson assisting him for seventeen of those years. Most of the stories are frame narratives written from Watson's point of view, as summaries of the detective's most interesting cases. Holmes frequently calls Watson's records of Holmes's cases sensational and populist, suggesting that they fail to accurately and objectively report the "science" of his craft:
Detection is, or ought to be, an exact science and should be treated in the same cold and unemotional manner. You have attempted to tinge it [A Study in Scarlet] with romanticism, which produces much the same effect as if you worked a love-story or an elopement into the fifth proposition of Euclid. ... Some facts should be suppressed, or, at least, a just sense of proportion should be observed in treating them. The only point in the case which deserved mention was the curious analytical reasoning from effects to causes, by which I succeeded in unravelling it.
Nevertheless, when Holmes recorded a case himself, he was forced to concede that he could more easily understand the need to write it in a manner that would appeal to the public rather than his intention to focus on his own technical skill.
Holmes's friendship with Watson is his most significant relationship. When Watson is injured by a bullet, although the wound turns out to be "quite superficial", Watson is moved by Holmes's reaction:
"""

keywords = kw_model.extract_keywords(test_text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=100, use_mmr=True)
print(*keywords, sep="\n")

Outputs from test file:
('000', 0.0)
('generally', 0.0)
('ˈʃɜːrlɒk', 0.0)
('earliest', 0.0)
('investigation', 0.0)
('helped', 0.0)
('including', 0.0)
('official', 0.0)
('1990s', 0.0)
('know', 0.0)
('describes', 0.0)
('auguste', 0.0)
('like', 0.0)
('later', 0.0)
('assisting', 0.0)
('thousands', 0.0)
('dr', 0.0)
('deserved', 0.0)
('monsieur', 0.0)
('fifth', 0.0)
('exact', 0.0)
('british', 0.0)
('year', 0.0)
('wrote', 0.0)
('literary', 0.0)
('lead', 0.0)
('self', 0.0)
('mrs', 0.0)
('kind', 0.0)
('years', 0.0)
('1880', 0.0)
('edinburgh', 0.0)
('senior', 0.0)
('cited', 0.0)
('tales', 0.0)
('lasting', 0.0)
...
all with 0.0 score

@MaartenGr
Copy link
Owner

Checking the documentation it seems that you can access the embedding layer as follows: https://spacy.io/api/curatedtransformer#doctransformeroutput-embeddinglayer. Which we can then perhaps use to average all tokens in order to create an embedding for the entire document. Having said that, it would be preferred if we could perhaps find the [cls] token to use but I cannot seem to find it in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants