Skip to content

Commit

Permalink
update README and setup
Browse files Browse the repository at this point in the history
  • Loading branch information
jararap committed Jan 16, 2025
1 parent 83164dd commit 8cdc33f
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 3 deletions.
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm **GreedTok**.
Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.

To do: Huggingface AutoTokenizer interface

### GreedTok
1. If using python wrapper

Expand Down Expand Up @@ -59,4 +61,12 @@ Our formulation naturally relaxes to the well-studied weighted maximum coverage
Evaluations in [eval_notebook.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_notebook.ipynb)
### Citation
TBD
```
@article{lim2025partition,
title={A partition cover approach to tokenization},
author={Lim, Jia Peng and Choo, Davin and Lauw, Hady W.},
year={2025},
journal={arXiv preprint arXiv:2501.06246},
url={https://arxiv.org/abs/2501.06246},
}
```
10 changes: 8 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from sysconfig import get_path
from setuptools import setup, Extension
from pathlib import Path

PATH_PREFIX = get_path('data')
module1 = Extension(f'greedy_builder',
Expand All @@ -13,15 +14,20 @@
libraries = ['tbb'],
sources = ['pcatt/greedy_builder.cpp'])

this_directory = Path(__file__).parent
long_description = (this_directory / "README.md").read_text()

setup(
name="greedtok",
version="0.1",
version="0.13",
description="Partition Cover Approach to Tokenization",
author="JP Lim",
author_email="[email protected]",
license = "MIT",
setup_requires=['pybind11', 'tbb-devel'],
url = "https://github.com/PreferredAI/pcatt/",
download_url = "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.13.tar.gz",
ext_modules = [module1]
ext_modules = [module1],
long_description=long_description,
long_description_content_type='text/markdown'
)

0 comments on commit 8cdc33f

Please sign in to comment.