A memory efficient Python module to quickly shuffle parallel text corpora. The main advantage is that it never loads the entire dataset in memory, using memory-mapped file offsets instead.
pip install fastshuffle
from fastshuffle import file_shuffle
src, tgt = file_shuffle("source.txt", "target.txt")
print(src, tgt)
# source.txt.shuffled target.txt.shuffled
You can also simultaneously sample/isolate a certain number of sentences from the dataset (which are then removed from the shuffled result)
from fastshuffle import file_shuffle_sample
src, tgt, src_sample, tgt_sample = file_shuffle("source.txt", "target.txt", 5) # Sample 5 sentences
print(src, tgt, src_sample, tgt_sample)
# source.txt.shuffled target.txt.shuffled source.txt.shuffled.sample target.txt.shuffled.sample
Source and target must have the same number of lines. No validation checks are made.
git clone https://github.com/LibreTranslate/FastShuffle
cd FastShuffle
python setup.py build
You can also use fastshuffle as a standalone Windows, macOS or Linux application (but you currently need to build from source, we don't provide binaries).
mkdir build
cd build && cmake .. && make -j4
./shuffle source.txt target.txt
We welcome pull requests!
AGPLv3