RemoveDup

A fast, memory efficient Python module to remove duplicates from parallel text corpora.

It's useful for cleaning up datasets that contain duplicate entries for training language models.

Installation

pip install removedup

Usage

from removedup import rdup

src, tgt, removed = rdup("source.txt", "target.txt")
print(src, tgt, removed)
# source.txt.dedup
# target.txt.dedup
# <num lines removed>

Notes

Source and target must have the same number of lines. No validation checks are made.

Duplication checks are only made on the source content. If you want to check for duplicates on the target, simply switch the order of the parameters.

Build

git clone https://github.com/LibreTranslate/RemoveDup
cd RemoveDup
python setup.py build

Standalone Binary

You can also use removedup as a standalone Windows, macOS or Linux application (but you currently need to build from source, we don't provide binaries).

mkdir build
cd build && cmake .. && make -j4
./rdup source.txt target.txt

Contributing

We welcome pull requests!

License

AGPLv3

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
tests		tests
vendor		vendor
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
dedup.cpp		dedup.cpp
dedup.hpp		dedup.hpp
main.cpp		main.cpp
pyproject.toml		pyproject.toml
removedup.cpp		removedup.cpp
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RemoveDup

Installation

Usage

Notes

Build

Standalone Binary

Contributing

License

About

Releases 7

Packages

Languages

License

LibreTranslate/RemoveDup

Folders and files

Latest commit

History

Repository files navigation

RemoveDup

Installation

Usage

Notes

Build

Standalone Binary

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages