Skip to content

Commit

Permalink
style: apply pre-commit hooks
Browse files Browse the repository at this point in the history
  • Loading branch information
afuetterer committed Jul 11, 2024
1 parent ba39b72 commit 6e76327
Show file tree
Hide file tree
Showing 38 changed files with 225 additions and 332 deletions.
6 changes: 3 additions & 3 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ kw_model = KeyLLM(llm)

* Use `KeyLLM` to leverage LLMs for extracting keywords
* Use it either with or without candidate keywords generated through `KeyBERT`
* Multiple LLMs are integrated: OpenAI, Cohere, LangChain, HF, and LiteLLM
* Multiple LLMs are integrated: OpenAI, Cohere, LangChain, HF, and LiteLLM

```python
import openai
Expand Down Expand Up @@ -101,7 +101,7 @@ doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
```

Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`.
Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`.

**Fixes**:

Expand Down Expand Up @@ -137,7 +137,7 @@ kw_model = KeyBERT(model=hf_model)

**NOTE**: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!

**Fixes**:
**Fixes**:

* Fix typo in ReadMe by [@priyanshul-govil](https://github.com/priyanshul-govil) in [#117](https://github.com/MaartenGr/KeyBERT/pull/117)
* Add missing optional dependencies (gensim, use, and spacy) by [@yusuke1997](https://github.com/yusuke1997)
Expand Down
6 changes: 3 additions & 3 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ topic modeling to HTML-code to extract topics of code, then it becomes important


## **How can I speed up the model?**
Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package.
Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.

A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words
need to only be embedded a single time, which can result in a major speed up.
A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words
need to only be embedded a single time, which can result in a major speed up.

This is **faster**:

Expand Down
6 changes: 3 additions & 3 deletions docs/guides/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ kw_model = KeyBERT(model=sentence_model)
```

### 🤗 **Hugging Face Transformers**
To use a Hugging Face transformers model, load in a pipeline and point
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models):

```python
Expand All @@ -32,8 +32,8 @@ kw_model = KeyBERT(model=hf_model)
```

!!! tip "Tip!"
These transformers also work quite well using `sentence-transformers` which has a number of
optimizations tricks that make using it a bit faster.
These transformers also work quite well using `sentence-transformers` which has a number of
optimizations tricks that make using it a bit faster.

### **Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
Expand Down
6 changes: 3 additions & 3 deletions docs/guides/keyllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ documents = [

This data was chosen to show the different use cases and techniques. As you might have noticed documents 1 and 2 are quite similar whereas document 3 is about an entirely different subject. This similarity will be taken into account when using `KeyBERT` together with `KeyLLM`

Let's start with `KeyLLM` only.
Let's start with `KeyLLM` only.


# Use Cases
Expand Down Expand Up @@ -180,7 +180,7 @@ If you have embeddings of your documents, you could use those to find documents
</div>

!!! Tip
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
There is an issue with community detection (cluster) that might make the model run without finishing. It is as straightforward as:
`pip uninstall sentence-transformers`
`pip install --upgrade git+https://github.com/UKPLab/sentence-transformers`
Expand Down Expand Up @@ -231,7 +231,7 @@ This is the best of both worlds. We use `KeyBERT` to generate a first pass of ke
</div>

!!! Tip
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
There is an issue with community detection (cluster) that might make the model run without finishing. It is as straightforward as:
`pip uninstall sentence-transformers`
`pip install --upgrade git+https://github.com/UKPLab/sentence-transformers`
Expand Down
14 changes: 7 additions & 7 deletions docs/guides/llms.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ In this tutorial we will be going through the Large Language Models (LLM) that c
Having the option to choose the LLM allow you to leverage the model that suit your use-case.

### **OpenAI**
To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model.
To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model.

We install the package first:

Expand Down Expand Up @@ -98,7 +98,7 @@ kw_model = KeyLLM(llm)
```

### 🤗 **Hugging Face Transformers**
To use a Hugging Face transformers model, load in a pipeline and point
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models). Let's use Llama 2 as an example:

```python
Expand All @@ -109,8 +109,8 @@ model_id = 'meta-llama/Llama-2-7b-chat-hf'

# 4-bit Quantization to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
Expand Down Expand Up @@ -152,15 +152,15 @@ I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken [INST]
I have the following document:
- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""
Expand Down Expand Up @@ -200,4 +200,4 @@ llm = LangChain(chain)

# Load it in KeyLLM
kw_model = KeyLLM(llm)
```
```
24 changes: 12 additions & 12 deletions docs/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,9 @@ keywords = kw_model.extract_keywords(doc, highlight=True)

## **Fine-tuning**

As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead
to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two
approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**.
As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead
to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two
approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**.

### **Max Sum Distance**

Expand Down Expand Up @@ -165,8 +165,8 @@ keywords = kw_model.extract_keywords(doc, seed_keywords=seed_keywords)

## **Prepare embeddings**

When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and
word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that
When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and
word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that
we only have to calculate it once:


Expand All @@ -183,15 +183,15 @@ You can then use these embeddings and pass them to `.extract_keywords` to speed
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
```

There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:
There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:

* `candidates`
* `keyphrase_ngram_range`
* `stop_words`
* `stop_words`
* `min_df`
* `vectorizer`

The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`.
The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`.

In other words, the following will work as they use the same parameter subset:

Expand All @@ -200,8 +200,8 @@ from keybert import KeyBERT

kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)
```

Expand All @@ -212,7 +212,7 @@ from keybert import KeyBERT

kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=3, stop_words="dutch")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)
```
4 changes: 2 additions & 2 deletions docs/images/guided.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/images/pipeline.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,4 @@ of words you would like in the resulting keyphrases:
```

!!! note "NOTE"
You can also pass multiple documents at once if you are looking for a major speed-up!
You can also pass multiple documents at once if you are looking for a major speed-up!
2 changes: 1 addition & 1 deletion docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
--md-typeset-a-color: #0277BD;
}

body[data-md-color-primary="black"] .excalidraw svg {
body[data-md-color-primary="black"] .excalidraw svg {
filter: invert(100%) hue-rotate(180deg);
}

Expand Down
5 changes: 5 additions & 0 deletions keybert/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,8 @@
from keybert._model import KeyBERT

__version__ = version("keybert")

__all__ = [
"KeyBERT",
"KeyLLM",
]
31 changes: 9 additions & 22 deletions keybert/_highlight.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,8 @@ class NullHighlighter(RegexHighlighter):
highlights = [r""]


def highlight_document(
doc: str, keywords: List[Tuple[str, float]], vectorizer: CountVectorizer
):
"""Highlight keywords in a document
def highlight_document(doc: str, keywords: List[Tuple[str, float]], vectorizer: CountVectorizer):
"""Highlight keywords in a document.
Arguments:
doc: The document for which to extract keywords/keyphrases.
Expand All @@ -38,10 +36,8 @@ def highlight_document(
console.print(highlighted_text)


def _highlight_one_gram(
doc: str, keywords: List[str], vectorizer: CountVectorizer
) -> str:
"""Highlight 1-gram keywords in a document
def _highlight_one_gram(doc: str, keywords: List[str], vectorizer: CountVectorizer) -> str:
"""Highlight 1-gram keywords in a document.
Arguments:
doc: The document for which to extract keywords/keyphrases.
Expand All @@ -57,18 +53,13 @@ def _highlight_one_gram(
separator = "" if "zh" in str(tokenizer) else " "

highlighted_text = separator.join(
[
f"[black on #FFFF00]{token}[/]" if token.lower() in keywords else f"{token}"
for token in tokens
]
[f"[black on #FFFF00]{token}[/]" if token.lower() in keywords else f"{token}" for token in tokens]
).strip()
return highlighted_text


def _highlight_n_gram(
doc: str, keywords: List[str], vectorizer: CountVectorizer
) -> str:
"""Highlight n-gram keywords in a document
def _highlight_n_gram(doc: str, keywords: List[str], vectorizer: CountVectorizer) -> str:
"""Highlight n-gram keywords in a document.
Arguments:
doc: The document for which to extract keywords/keyphrases.
Expand All @@ -85,8 +76,7 @@ def _highlight_n_gram(
separator = "" if "zh" in str(tokenizer) else " "

n_gram_tokens = [
[separator.join(tokens[i : i + max_len][0 : j + 1]) for j in range(max_len)]
for i, _ in enumerate(tokens)
[separator.join(tokens[i : i + max_len][0 : j + 1]) for j in range(max_len)] for i, _ in enumerate(tokens)
]
highlighted_text = []
skip = False
Expand All @@ -96,11 +86,8 @@ def _highlight_n_gram(

if not skip:
for index, n_gram in enumerate(n_grams):

if n_gram.lower() in keywords:
candidate = (
f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
)
candidate = f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
skip = index + 1

if not candidate:
Expand Down
24 changes: 10 additions & 14 deletions keybert/_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,21 @@

try:
from sentence_transformers import util

HAS_SBERT = True
except ModuleNotFoundError:
HAS_SBERT = False


class KeyLLM:
"""
A minimal method for keyword extraction with Large Language Models (LLM)
"""A minimal method for keyword extraction with Large Language Models (LLM).
The keyword extraction is done by simply asking the LLM to extract a
number of keywords from a single piece of text.
"""

def __init__(self, llm):
"""KeyBERT initialization
"""KeyBERT initialization.
Arguments:
llm: The Large Language Model to use
Expand All @@ -29,9 +29,9 @@ def extract_keywords(
check_vocab: bool = False,
candidate_keywords: List[List[str]] = None,
threshold: float = None,
embeddings=None
embeddings=None,
) -> Union[List[str], List[List[str]]]:
"""Extract keywords and/or keyphrases
"""Extract keywords and/or keyphrases.
To get the biggest speed-up, make sure to pass multiple documents
at once instead of iterating over a single document.
Expand All @@ -44,6 +44,8 @@ def extract_keywords(
docs: The document(s) for which to extract keywords/keyphrases
check_vocab: Only return keywords that appear exactly in the documents
candidate_keywords: Candidate keywords for each document
threshold: TODO
embeddings: TODO
Returns:
keywords: The top n keywords for a document with their respective distances
Expand Down Expand Up @@ -78,7 +80,6 @@ def extract_keywords(
return []

if HAS_SBERT and threshold is not None and embeddings is not None:

# Find similar documents
clusters = util.community_detection(embeddings, min_community_size=2, threshold=threshold)
in_cluster = set([cluster for cluster_set in clusters for cluster in cluster_set])
Expand All @@ -97,21 +98,16 @@ def extract_keywords(
)
out_cluster_keywords = {index: words for words, index in zip(out_cluster_keywords, out_cluster)}

# Extract keywords for only the first document in a cluster
# Extract keywords for only the first document in a cluster
if in_cluster:
selected_docs = [docs[cluster[0]] for cluster in clusters]
if candidate_keywords is not None:
selected_keywords = [candidate_keywords[cluster[0]] for cluster in clusters]
else:
selected_keywords = None
in_cluster_keywords = self.llm.extract_keywords(
selected_docs,
selected_keywords
)
in_cluster_keywords = self.llm.extract_keywords(selected_docs, selected_keywords)
in_cluster_keywords = {
doc_id: in_cluster_keywords[index]
for index, cluster in enumerate(clusters)
for doc_id in cluster
doc_id: in_cluster_keywords[index] for index, cluster in enumerate(clusters) for doc_id in cluster
}

# Update out cluster keywords with in cluster keywords
Expand Down
Loading

0 comments on commit 6e76327

Please sign in to comment.