RecordIO is a file format created for PaddlePaddle Elastic Deep Learning. It is generally useful for distributed computing.
In distributed computing, we often need to dispatch tasks to worker processes. Usually, a task is defined as a partition of the input data, like what MapReduce and distributed machine learning do.
Most distributed filesystems, including HDFS, Google FS, and CephFS, prefer a small number of big files. Therefore, it is impractical to create each task as a small file; instead, we need a format for big files that is
- appendable, so that applications can append records to the file without updating the meta-data, thus fault tolerable,
- partitionable, so that applications can quickly scan over the file to count the total number of records, and create tasks each corresponds to a sequence of records.
RecordIO is such a file format.
import recordio
# write
with recordio.File('demo.recordio', 'w') as rdio_w:
rdio_w.write('abc')
rdio_w.write('def')
with recordio.File('demo.recordio', 'r') as rdio_r:
# Random access
for i in range(rdio_r.count()):
print(rdio_r.get(i))
# Range reading
for record in rdio_r.get_reader(2, 10):
print(record)
# Direct iteration
for record in rdio_r:
print(record)
In this directory:
python -m unittest recordio/recordio/*_test.py
The package process largely follows the example of Tensorflow custom op
First build RecordIO devel Docker image:
docker build -t recordio:dev -f Dockerfile .
Start Docker container and map git
and bazel .cache
directories:
docker run --rm -it \
-v $HOME/git:/git \
-v $HOME/.cache:/.cache \
-w /git/pyrecordio \
recordio:dev
Inside container, build the pip package:
bazel build build_pip_pkg
bazel-bin/build_pip_pkg artifacts
After building the package, force install it to replace the existing version.
pip install -I artifacts/recordio-<version>.whl
To test the installed TensorFlow RecordIO Dataset op:
python recordio/tensorflow_op/python/tf_recordio_dataset_test.py
There is also a prepackded version checked into the repo for convenience.