Skip to content

Commit

Permalink
feat: remove autoreach
Browse files Browse the repository at this point in the history
  • Loading branch information
stephantul committed Jul 14, 2024
1 parent 5a9e3eb commit abe1fb4
Show file tree
Hide file tree
Showing 7 changed files with 2 additions and 285 deletions.
56 changes: 0 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,62 +113,6 @@ On my machine (a 2022 M1 macbook pro), we get the following times for [`COW BIG`

`reach` has a special fast format, which is useful if you want to reload your word vectors often. The fast format can be created using the `save_fast_format` function, and loaded using the `load_fast_format` function. This is about equivalent to saving word vectors in `gensim`'s own format in terms of loading speed.

# autoreach

Reach also has a way of automatically inferring words from strings without using a pre-defined tokenizer, i.e., without splitting the string into words. This is useful because there might be mismatches between the tokenizer you happen to have on hand, and the word vectors you use. For example, if your vector space contains an embedding for the word `"it's"`, and your tokenizer splits this string into two tokens: `["it", "'s"]`, the embedding for `"it's"` will never be found.

autoreach solves this problem by only finding words from your pre-defined vocabulary in a string, this removing the need for any tokenization. We use the [aho-corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm), which allows us to find substrings in linear time. The downside of using aho-corasick is that it also finds substrings of regular words. For example, the word `the` will be found as a substring of `these`. To circumvent this, we perform a regex-based clean-up step.

**Warning! The clean-up step involves checking for surrounding spaces and punctuation marks. Hence, if the language for which you use Reach does not actually use spaces and/or punctuation marks to designate word boundaries, the entire process might not work.**

### Example

```python
import numpy as np

from reach import AutoReach

words = ["dog", "walked", "home"]
vectors = np.random.randn(3, 32)

r = AutoReach(vectors, words)

sentence = "The dog, walked, home"
bow = r.bow(sentence)

found_words = [r.indices[index] for index in bow]
```

### benchmark

Because we no longer need to tokenize, `AutoReach` can be many times faster. In this benchmark, we compare to just splitting, and `nltk`'s `word_tokenize` function.

We will use the entirety of Mary Shelley's Frankenstein, which you can find [here](https://www.gutenberg.org/cache/epub/42324/pg42324.txt), and the glove.6b.50d vectors, which you can find [here](https://nlp.stanford.edu/data/glove.6B.zip).

```python
from pathlib import Path

from nltk import word_tokenize

from reach import AutoReach, Reach


txt = Path("pg42324.txt").read_text().lower()
normal_reach = Reach.load("glove.6B.100d.txt")
auto_reach = AutoReach.load("glove.6B.100d.txt")

# Ipython magic commands
%timeit normal_reach.vectorize(word_tokenize(txt), remove_oov=True)
# 345 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit normal_reach.vectorize(txt.split(), remove_oov=True)
# 25.4 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit auto_reach.vectorize(txt)
# 69.9 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

```

As you can see, the tokenizer introduces significant overhead compared to just splitting, while using the aho-corasick algorithm to split is still reasonably fast.

# License

MIT
Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ target-version = "py311"
[[tool.mypy.overrides]]
module = [
"tqdm.*",
"ahocorasick.*",
"setuptools.*",
]
ignore_missing_imports = true
124 changes: 0 additions & 124 deletions reach/autoreach.py

This file was deleted.

3 changes: 2 additions & 1 deletion reach/reach.py
Original file line number Diff line number Diff line change
Expand Up @@ -1006,8 +1006,9 @@ def save_fast_format(
metadata = {
"unk_token": self.unk_token,
"name": self.name,
**(additional_metadata or {}),
}
if additional_metadata is not None:
metadata.update(additional_metadata)

items = self.sorted_items
items_dict = {
Expand Down
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
numpy
tqdm
pyahocorasick
1 change: 0 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
license="MIT",
packages=find_packages(include=["reach"]),
install_requires=["numpy", "tqdm"],
extras_require={"auto": ["pyahocorasick"]},
project_urls={
"Source Code": "https://github.com/stephantul/reach",
"Issue Tracker": "https://github.com/stephantul/reach/issues",
Expand Down
101 changes: 0 additions & 101 deletions tests/test_auto.py

This file was deleted.

0 comments on commit abe1fb4

Please sign in to comment.