Read and write parquet files from Stata (Linux/Unix only).
This package uses the Apache Arrow C++ library to read and write parquet files from Stata using plugins. Currently this package is only available in Stata for Unix (Linux).
version 0.6.5 22Oct2023
| Installation | Usage | Examples
You need to first install:
- The Apache Arrow C++ library.
- The GNU Compiler Collection
- The Boost C++ libraries.
- Google's logging library (google-glog)
First, intall Google's logging library: libgoogle-glog-dev
in Ubuntu, google-glog
in Arch (you may have to link libglog.so
to libglog.so.0
), and so on. Then the only tested way to install this software is via conda
(see here for installation instructions; most recent plugin installation and tests were conducted using Miniconda3 for Python 3.8, version 23.3.1
):
git clone https://github.com/mcaceresb/stata-parquet
cd stata-parquet
conda env create -f environment.yml
conda activate stata-parquet
make SPI=3.0 GCC=${CONDA_PREFIX}/bin/x86_64-conda_cos6-linux-gnu-g++ UFLAGS=-std=c++11 INCLUDE=${CONDA_PREFIX}/include LIBS=${CONDA_PREFIX}/lib all
stata -b "net install parquet, from(${PWD}/build) replace"
rm -f stata.log
Note: If you have Stata 14.0 or earlier you will want to use SPI=2.0
instead.
Warning: The plugin uses a possibly dated version of parquet (specifically parquet-cpp
version 1.5.1
and arrow-cpp
version 0.14.1
).
Activate the Conda environment with
conda activate stata-parquet
Then be sure to start Stata via
LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH xstata
Alternatively, you could add the following line to your ~/.bashrc
to not have
to enter the LD_LIBRARY_PATH
every time (make sure to replace
${CONDA_PREFIX}
with the absolute path it represents):
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH
Then just start Stata with
xstata
parquet save
and parquet use
will save and load datasets in Parquet
format, respectively. parquet desc
will describe the contents of a
parquet dataset. For example:
sysuse auto, clear
parquet save auto.parquet, replace
parquet desc auto.parquet
parquet use auto.parquet, clear
desc
parquet use price make gear_ratio using auto.parquet, clear in(10/20)
parquet save gear_ratio make using auto.parquet in 5/6 if price > 5000, replace
Note that the if
clause is not supported by parquet use
. To test the
plugin works as expected, run do build/parquet_tests.do
from Stata. To
also test the plugin correctly reads hive
format datasets, run
conda install -n stata-parquet pandas numpy fastparquet
conda activate stata-parquet
Then, from Stata, do build/parquet_tests.do python
- Writing
strL
variables is not yet supported. - Reading binary ByteArray data is not supported, only strings.
Int96
variables is not supported, as is has no direct Stata counterpart.- Maximum string widths are not generally stored in
.parquet
files (as far as I can tell). The default behavior is to scan string columns to get the largest string, but it can be time-intensive. Adjust this behavior viastrscan()
andstrbuffer()
.
Some features that ought to be implemented:
- Option
skip
for columns that are in non-readable formats? - Write regular missing values (high-level only).
Some features that might not be implementable, but the user should be warned about them:
- Extended missing values (user gets a warning).
-
strL
variables - Variable formats
- Variable labels
- Value labels
- Dataset notes
- Variable characteristics
- ByteArray or FixedLenByteArray with binary data.
Improve:
- Boolean format to/from Stata.
- Best way to transpose from column order to row order.
stata-parquet
is MIT-licensed.