Skip to content

Latest commit

 

History

History
27 lines (19 loc) · 1.06 KB

README.md

File metadata and controls

27 lines (19 loc) · 1.06 KB

OSCAR2Parquet

This cli tool converts OSCAR's jsonl files into parquet. It takes Ungoliant's output as input and writes the parquet files to the destination folder. This tool intends to replace the splitting and compression steps of the OSCAR generation previously performed by oscar-tools.

Todo

  • Add Python bindings
  • Add tests
  • Add option to control the maximum number of rows per parquet file

Usage

oscar2parquet -h
Converts OSCAR's jsonl files into parquet.

Usage: oscar2parquet [OPTIONS] <INPUT FOLDER> <DESTINATION FOLDER>

Arguments:
  <INPUT FOLDER>        Folder containing the indices
  <DESTINATION FOLDER>  Parquet file to write

Options:
  -t, --threads <NUMBER OF THREADS>  Number of threads to use [default: 10]
  -h, --help                         Print help
  -V, --version                      Print version