Skip to content

Commit

Permalink
Merge pull request #68 from bact/add-license-spdx-in-bindings
Browse files Browse the repository at this point in the history
Add license header to bindings
  • Loading branch information
bact authored Nov 9, 2024
2 parents c5b81ca + ef6af04 commit 96d9c27
Show file tree
Hide file tree
Showing 21 changed files with 146 additions and 57 deletions.
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@ keywords:
- "Thai language"
- "Thai NLP"
license: Apache-2.0
version: v1.3.2
date-released: "2021-07-17"
version: v1.4.0
date-released: "2024-11-09"
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "nlpo3"
version = "1.3.2"
version = "1.4.0"
edition = "2018"
license = "Apache-2.0"
authors = ["Thanathip Suntorntip Gorlph"]
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ pip install nlpo3
or from `Vec<String>`

[tcc]: https://dl.acm.org/doi/10.1145/355214.355225
[benchmark]: https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb
[benchmark]: ./nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb

## Dictionary file

Expand Down Expand Up @@ -110,7 +110,7 @@ In `Cargo.toml`:
```toml
[dependencies]
# ...
nlpo3 = "1.3.2"
nlpo3 = "1.4.0"
```

#### Example
Expand Down
4 changes: 2 additions & 2 deletions nlpo3-cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "nlpo3-cli"
version = "0.2.0"
version = "0.2.1-dev"
edition = "2018"
authors = ["Vee Satayamas <[email protected]>"]
description = "Command line interface for nlpO3, a Thai natural language processing library"
Expand All @@ -18,4 +18,4 @@ path = "src/main.rs"

[dependencies]
clap = { version = "3.0.0-beta.2" }
nlpo3 = { version = "1.2.0" }
nlpo3 = { version = "1.4.0" }
9 changes: 7 additions & 2 deletions nlpo3-cli/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
<a href="https://crates.io/crates/nlpo3-cli/"><img alt="crates.io" src="https://img.shields.io/crates/v/nlpo3-cli.svg"/></a>
<a href="https://opensource.org/licenses/Apache-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg"/></a>
---
SPDX-FileCopyrightText: 2024 PyThaiNLP Project
SPDX-License-Identifier: Apache-2.0
---

# nlpo3-cli

[![crates.io](https://img.shields.io/crates/v/nlpo3-cli.svg "crates.io")](https://crates.io/crates/nlpo3-cli/)
[![Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache-2.0")](https://opensource.org/licenses/Apache-2.0)

Command line interface for nlpO3, a Thai natural language processing library.

## Install
Expand Down
3 changes: 3 additions & 0 deletions nlpo3-cli/src/main.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

use clap::Clap;
use nlpo3::tokenizer::newmm_custom::Newmm;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;
Expand Down
10 changes: 5 additions & 5 deletions nlpo3-nodejs/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "nlpo3-nodejs"
version = "0.3.0"
version = "1.0.0"
edition = "2018"
license = "Apache-2.0"
authors = ["Thanathip Suntorntip Gorlph"]
Expand All @@ -10,11 +10,11 @@ exclude = ["index.node"]
crate-type = ["cdylib"]

[dependencies]
ahash = "0.7.6"
lazy_static = "1.4.0"
nlpo3 = "1.3.2"
ahash = "0.8.6"
lazy_static = "1.5.0"
nlpo3 = "1.4.0"

[dependencies.neon]
version = "0.8"
version = "0.22.6"
default-features = false
features = ["napi-6"]
32 changes: 22 additions & 10 deletions nlpo3-nodejs/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
<a href="https://opensource.org/licenses/Apache-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg"/></a>
---
SPDX-FileCopyrightText: 2024 PyThaiNLP Project
SPDX-License-Identifier: Apache-2.0
---

# nlpO3 Node.js binding

[![Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache-2.0")](https://opensource.org/licenses/Apache-2.0)

Node.js binding for nlpO3, a Thai natural language processing library in Rust.

## Features

- Thai word tokenizer
- use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- fast backend in Rust
- support custom dictionary
- Use maximal-matching dictionary-based tokenization algorithm
and honor [Thai Character Cluster][tcc] boundaries
- Fast backend in Rust
- Support custom dictionary

[tcc]: https://dl.acm.org/doi/10.1145/355214.355225

## Build

Expand All @@ -26,14 +34,16 @@ npm run release
```

Before build, your `nlpo3/` directory should look like this:
```

```text
- nlpo3/
- index.ts
- rust_mod.d.ts
```

After build:
```

```text
- nlpo3/
- index.js
- index.ts
Expand All @@ -47,15 +57,16 @@ For now, copy the whole `nlpo3/` directory after build to your project.

### npm (experitmental)

npm is still experimental and may not work on all platforms. Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
npm is still experimental and may not work on all platforms. Please report issues at <https://github.com/PyThaiNLP/nlpo3/issues>

```bash
```shell
npm i nlpo3
```

## Usage

In JavaScript:

```javascript
const nlpO3 = require(`${path_to_nlpo3}`)

Expand All @@ -65,6 +76,7 @@ nloO3.segment("สวัสดีครับ", "dict_name")
```

In TypeScript:

```typescript
import {segment, loadDict} from `${path_to_nlpo3}/index`

Expand All @@ -75,8 +87,8 @@ segment("สวัสดีครับ", "dict_name")

## Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
Please report issues at <https://github.com/PyThaiNLP/nlpo3/issues>

# TODO
## TODO

- Find a way to build binaries and publish on npm.
3 changes: 3 additions & 0 deletions nlpo3-nodejs/nlpo3/index.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

import * as nativeModule from './rust_mod'
/**
* Load dict from dictionary file and store in hash map with key = dictName for ***segment*** function to use.
Expand Down
3 changes: 3 additions & 0 deletions nlpo3-nodejs/nlpo3/rust_mod.d.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

export function segment(text: string, dict_name: string, safe: boolean, parallel: boolean): string[];
/** file_path is an absolute path */
export function loadDict(file_path: string, dict_name: string): string;
2 changes: 1 addition & 1 deletion nlpo3-nodejs/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,4 @@
"files": [
"nlpo3"
]
}
}
6 changes: 5 additions & 1 deletion nlpo3-nodejs/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
// SPDX-FileCopyrightText: 2024 PyThaiNLP Project
// SPDX-License-Identifier: Apache-2.0

use std::sync::Mutex;

use ahash::AHashMap as HashMap;
Expand All @@ -6,7 +9,8 @@ use neon::prelude::*;
use nlpo3::tokenizer::{newmm::NewmmTokenizer, tokenizer_trait::Tokenizer};

lazy_static! {
static ref DICT_COLLECTION:Mutex<HashMap<String,Box<NewmmTokenizer>>> = Mutex::new(HashMap::new());
static ref DICT_COLLECTION: Mutex<HashMap<String, Box<NewmmTokenizer>>> =
Mutex::new(HashMap::new());
}

fn load_dict(mut cx: FunctionContext) -> JsResult<JsString> {
Expand Down
11 changes: 6 additions & 5 deletions nlpo3-python/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
[package]
name = "nlpo3-python"
version = "1.3.0"
version = "1.3.1-dev"
edition = "2018"
license = "Apache-2.0"
authors = ["Thanathip Suntorntip Gorlph"]
description = "Python binding for nlpO3 Thai language processing library"
exclude = ["notebooks"]
keywords = ["thai", "tokenizer", "nlp", "word-segmentation", "python"]

[lib]
name = "_nlpo3_python_backend"
path = "src/lib.rs"
crate-type = ["cdylib", "rlib"]

[dependencies]
ahash = "0.7.6"
lazy_static = "1.4.0"
nlpo3 = "1.3.2"
ahash = "0.8.6"
lazy_static = "1.5.0"
nlpo3 = "1.4.0"

[dependencies.pyo3]
version = "0.15.0"
version = "0.22.6"
features = ["extension-module"]
52 changes: 38 additions & 14 deletions nlpo3-python/README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,46 @@
<a href="https://pypi.python.org/pypi/nlpo3"><img alt="pypi" src="https://img.shields.io/pypi/v/nlpo3.svg"/></a>
<a href="https://www.python.org/downloads/release/python-360/"><img alt="Python 3.6" src="https://img.shields.io/badge/python-3.6-blue.svg"/></a>
<a href="https://opensource.org/licenses/Apache-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg"/></a>
<a href="https://pepy.tech/project/nlpo3"><img alt="Downloads" src="https://pepy.tech/badge/nlpo3/month"/></a>
---
SPDX-FileCopyrightText: 2024 PyThaiNLP Project
SPDX-License-Identifier: Apache-2.0
---

# nlpO3 Python binding

[![PyPI](https://img.shields.io/pypi/v/nlpo3.svg "PyPI")](https://pypi.python.org/pypi/nlpo3)
[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg "Python 3.6")](https://www.python.org/downloads/)
[![Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache-2.0")](https://opensource.org/license/apache-2-0)

Python binding for nlpO3, a Thai natural language processing library in Rust.

## Features

- Thai word tokenizer
- `segment()` - use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- [2.5x faster](notebooks/nlpo3_segment_benchmarks.ipynb) than similar pure Python implementation (PyThaiNLP's newmm)
- `load_dict()` - load a dictionary from plain text file (one word per line)
- `segment()` - use maximal-matching dictionary-based tokenization algorithm
and honor [Thai Character Cluster][tcc] boundaries
- [2.5x faster][benchmark]
than similar pure Python implementation (PyThaiNLP's newmm)
- `load_dict()` - load a dictionary from a plain text file
(one word per line)

[tcc]: https://dl.acm.org/doi/10.1145/355214.355225
[benchmark]: ./notebooks/nlpo3_segment_benchmarks.ipynb

## Dictionary file

- For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use.
It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
- For the interest of library size, nlpO3 does not assume what dictionary the
user would like to use, and it does not come with a dictionary.
- A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- [words_th.tx](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt) from [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp/) - around 62,000 words (CC0)
- [word break dictionary](https://github.com/tlwg/libthai/tree/master/data) from [libthai](https://github.com/tlwg/libthai/) - consists of dictionaries in different categories, with make script (LGPL-2.1)

- [words_th.tx][dict-pythainlp] from [PyThaiNLP][pythainlp]
- ~62,000 words
- CC0-1.0
- [word break dictionary][dict-libthai] from [libthai][libthai]
- consists of dictionaries in different categories, with a make script
- LGPL-2.1

[pythainlp]: https://github.com/PyThaiNLP/pythainlp
[libthai]: https://github.com/tlwg/libthai/
[dict-pythainlp]: https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt
[dict-libthai]: https://github.com/tlwg/libthai/tree/master/data

## Install

Expand All @@ -34,6 +52,7 @@ pip install nlpo3

Load file `path/to/dict.file` to memory and assign a name `dict_name` to it.
Then tokenize a text with the `dict_name` dictionary:

```python
from nlpo3 import load_dict, segment

Expand All @@ -42,18 +61,22 @@ segment("สวัสดีครับ", "dict_name")
```

it will return a list of strings:

```python
['สวัสดี', 'ครับ']
```

(result depends on words included in the dictionary)

Use multithread mode, also use the `dict_name` dictionary:

```python
segment("สวัสดีครับ", dict_name="dict_name", parallel=True)
```

Use safe mode to avoid long waiting time in some edge cases
for text with lots of ambiguous word boundaries:

```python
segment("สวัสดีครับ", dict_name="dict_name", safe=True)
```
Expand All @@ -77,8 +100,9 @@ python -m pip install --upgrade build
python -m build
```

This should generate a wheel file, in `dist/` directory, which can be installed by pip.
This should generate a wheel file, in `dist/` directory,
which can be installed by pip.

## Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
Please report issues at <https://github.com/PyThaiNLP/nlpo3/issues>
3 changes: 3 additions & 0 deletions nlpo3-python/nlpo3/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# SPDX-FileCopyrightText: 2024 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

# Python-binding for nlpO3, an natural language process library.
#
# Provides a tokenizer.
Expand Down
19 changes: 16 additions & 3 deletions nlpo3-python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ build-backend = "setuptools.build_meta"

[project]
name = "nlpo3"
version = "1.3.0"
version = "1.3.1-dev"
description = "Python binding for nlpO3 Thai language processing library in Rust"
readme = "README.md"
requires-python = ">=3.6"
license = {text = "Apache-2.0"}
license = { text = "Apache-2.0" }
keywords = ["thai", "tokenizer", "nlp", "word-segmentation", "pythainlp"]
authors = [
{ name = "Thanathip Suntorntip" },
Expand All @@ -22,6 +22,10 @@ classifiers = [
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Intended Audience :: Developers",
"License :: OSI Approved :: Apache Software License",
"Natural Language :: Thai",
Expand All @@ -43,7 +47,16 @@ wheel = "*"

[tool.black]
line-length = 79
target_version = ['py36', 'py37', 'py38', 'py39']
target_version = [
'py36',
'py37',
'py38',
'py39',
'py310',
'py311',
'py312',
'py313',
]
experimental_string_processing = true
exclude = '''
/(
Expand Down
Loading

0 comments on commit 96d9c27

Please sign in to comment.