7.0.0 - 2025-01-06
In the past users struggled to find the --column-length-limit
options. Therefore the default behavior of odbc2parquet
is now to set it by default to 4096
. In order to prevent silent data loss due to truncation as a consequence of this change, reporting truncation errors is now always active. In addition to that the error message for the truncation errors have been improved, mentioning the affected column as well as hinting that increasing the --column-length-limit
option might be a good idea.
- [breaking] column-length-limit now defaults to 4096
- Report truncations for sequential fetches
- Mention column name in truncation error.
- Error message for truncation now hints at column-length-limit option.
- [breaking] The
--concurrent-fetching
flag has been removed, since concurrent fetching is now the new default behavior. The--sequential-fetching
flag has been introduced to opt into the old behaviour.
- Utilize upstream dependency
odbc-api 10.1
which autodetects homebrew library path. This allows for easier builds on Mac-OS ARM platforms.
- Fix: A panic then inserting from a parquet file there the last row group has less rows than the other row groups.
- Introduced flag
--concurrent-fetching
. Setting it uses separate system threads for writing to parquet and fetching from the database. This can be a significant speepedup, but also increases memory consumption.
Failed release
- DEBUG Log messages now show column names as text, rather than utf-8 bytes
- Not enough memory message mentions option to limit length using
--column-length-limit
- Utilize ODBC API version 3.5 instead of 3.8 to increase compatibility with older drivers.
- Binary release for Ubuntu ARM architectures. Thanks @sindilevich
- Fix release 6.0.1
- Binary release names for artifacts now end in architecture rather native bit size
- File extensions are now retained then splitting files. E.g. if
--output
is 'my_results.parquet' and split into two files they will be named 'my_results_01.parquet' and 'my_results_02.parquet'. Previously there has been always the ending '.par' attached.
- Fix: 5.1.0 introduced a regression, which caused output file enumeration to happen even if file splitting is not activated, if
--no-empty-file
had been set.
- Fix: 5.1.0 introduced a regression, which caused output file enumeration to be start with a suffix of
2
instead of1
if in addition to file splitting the--no-empty-file
flag had also been set.
- Additional log message with info level emitting the number of total number of rows written, the file size and the path for each file.
- Removed flag
--driver_returns_memory_garbage_for_indicators
. Turns out the issue with IBM DB/2 drivers which triggered this can better be solved using a version of their ODBC driver which ends ino
and is compiled with a 64Bit size forSQLLEN
. - Release for MacOS M1 (Thanks to free tier of fly.io)
- Updated dependencies, including a bug fix in decimal parsing. Negative decimals smaller than 1 would have misjuged as positive.
- Updated dependencies, including an update to
parquet-rs 50
- Decimal parsing is now more robust, against different radix characters and missing trailing zeroes.
- Introduced flag
--driver-returns-memory-garbage-for-indicators
. This is a reaction to witnessing IBM DB2/Linux drivers filling the indicator arrays with memory garbage. Activating this flag will activate a workaround using terminating zeroes to determine string length.odbc2parquet
will not be able to distinguish between empty strings and NULL anymore with this active and map everything to NULL. Currently the workaround is only active for UTF-8 encoded payloads.
- Default compression is now
zstd
with level3
.
- Fix: If ODBC drivers report
-4
(NO_TOTAL
) as display size, the size can now be controlled with--column-length-limit
. The issue occurred for JSON columns with MySQL.
- Fix: Invalid UTF-16 encoding emitted from the data source will now cause an error instead of a panic.
- Additional log message emitting the number of total rows fetched so far.
- Fix:
--no-empty-file
now works correctly with options causing files to be splitted like--file-size-threshold
or--row-groups-per-file
.
- Zero sized columns are now treated as an error. Before
odbc2parquet
issued a warning and ignored them. - Fix:
--file-size-threshold
had an issue with not resetting the current file size after starting a new file. This caused only the first file to have the desired size. All subsequent files would contain only one row group each.
- Fix:
--column-length-limit
not only caused large variadic columns to have a sensible upper bound, but also caused columns with known smaller bound to allocate just as much memory, effectivly wasting a lot of memory in some scenarios. In this version the limit will only be applied if the column length actually exceeds the length limit specified. - Fix: Some typos in the
--help
text have been fixed.
- Fix: Then fetching relational type
TINYINT
, the driver is queried for the signess of the column. The result is now reflected in the logical type written into parquet. In the past theTINYINT
has always been assumed to be signed, even if the ODBC driver would have described the column as unsigned.
- Fix: The
--help
subcommand for query wrongly listedbit-packed
as supported.
- Explicitly check for lower and upper bound if writing timestamps with nanoseconds precision into a parquet file. Timestamps have to be between
1677-09-21 00:12:44
and2262-04-11 23:47:16.854775807
. Emit an error and abort if bound checks fails.
- Write timestamps with precision greater than seven as nanoseconds
- Write Parquet Version 2.0
- Establish semantic versioning
- Update dependencies
- Update dependencies
- Improves error message in case creating an output file fails.
- Accidential release from branch.
- Adds option
--column-compression-level-default
to specify compression level explicitly.
- Introduced now option for
query
subcommand--column-length-limit
to limit the memory allocated for an indivual variadic column of the result set.
- Updated dependencies
- Time(p: 7..) is mapped to Timestamp Nanoseconds for Microsoft SQL Server
- Time(p: 4..=6) is mapped to Timestamp Microseconds for Microsoft SQL Server
- Time(p: 0..=3) is mapped to Timestamp Milliseconds for Microsoft SQL Server
- Introduced new flag for
query
subcommand--no-empty-file
which prevents creation of an output file in case the query comes back with0
rows.
- Updated dependencies
- Time(p) is mapped to Timestamp Nanoseconds for Microsoft SQL Server
- Fix: Fixed an issue there setting the
--column-compression-default
tosnappy
did result in the column compression default actually being set tozstd
.
- Introduce flag
avoid-decmial
to produce output without logical typeDECIMAL
. This allows artefacts without decimal support to process the output ofodbc2parquet
.
- The level of verbosity had been one to high:
--quiet
now suppresses warning messages as intended.-v
->Info
-vv
->Debug
DATETIMEOFFSET
onMicrosoft SQL Server
now is mapped toTIMESTAMP
with instant semantics i.e. its mapped to UTC and understood to reference a specific point in time, rather than a wall clock time.
- Allow specifying ODBC connection string via environment variable
ODBC_CONNECTION_STRING
instead of--connection_string
option.
- Pad suffixes
_01
with leading zeroes to make the file names more friendly for lexical sorting if splitting fetch output. Number is padded to two digits by default. - Updated dependencies
- Use narrow text on non-windows platforms by default. Connection strings, queries and error messages are assumed to be UTF-8 and not transcoded to and from UTF-16.
- Physical type of
DECIMAL
is nowINT64
instead ofFIXED_LEN_BYTE_ARRAY
if precision does not exceed 18. - Physical type of
DECIMAL
is nowINT32
instead ofFIXED_LEN_BYTE_ARRAY
if precision does not exceed 9. - Dropped support for Decimals and a Numeric with precision higher than
38
. Please open issue if required. Microsoft SQL does support this type up to this precision so currently there is no easy way to test forDECIMAL
s which can not be represented asi128
. - Fetching decimal columns with scale
0
and--driver-does-not-support-64bit-integers
now specifies the logical type asDECIMAL
. Physical type remains a 64 Bit Integer. - Updated dependencies
- Updated dependencies
- Pass
-
as query string to read the query statement text from standard in instead.
- Updated dependencies
- Release binary artifact for
x86_64-ubuntu
.
- Introduced flag
--no-color
which allows to supress emitting colors for the log output.
query
now allows for specifying-
as a positional output argument in order to stream to standard out instead of writing to a file.
--batches-per-file
is now named--row-groups-per-file
.- New
query
option--file-size-threshold
.
- Fixed bug causing
--batch-size-memory
to be interpreted as many times the specified limit.
- Updated dependencies. Including
parquet 15.0.0
query
option--batch-size-mib
is now--batch-size-memory
and allows specifying inputs with SI units. E.g.2GiB
.
- Updated dependencies. Improvements in upstream
odbc-api
may lead to faster insertion if using many batches.
- Updated dependencies. Including
parquet 14.0.0
- Updated dependencies.
- Undo: Recover from failed memory allocations of binary and text buffers, because of unclear performance implications.
- Recover from failed memory allocations of binary and text buffers and terminate the tool gracefully.
- Update dependencies. This includes an upstream improvement in
odbc-api 0.36.1
which emits a better error if theunixODBC
version does not supportODBC 3.80
.
- Updated dependencies
- Updated dependencies
Peace for the citizens of Ukraine who fight for their freedom and stand up to oppression. Peace for the Russian soldier, who does not know why he is shooting at his brothers and sisters, may he be reunited with his family soon.
Peace to 🇺🇦, 🇷🇺 and the world. May sanity prevail.
- Updating dependencies.
- Added message for Oracle users telling them about the
--driver-does-not-support-64bit-integers
, if SQLFetch fails withHY004
.
- Update dependencies. Including
parquet 9.0.2
.
- Introduce flag
--driver-does-not-support-64bit-integers
in order to compensate for missing 64 Bit integer support in the Oracle driver.
- Updated dependencies
- Including update to
parquet 8.0.0
- Including update to
- Updated dependencies
- Including update to
parquet 7.0.0
- Including update to
- New
Completion
subcommand to generate shell completions. - Fix: An issue with not reserving enough memories for the largest possible string if the octet length reported by the driver is to small. Now calculation is based on column sized.
- Update dependencies.
- Includes upstream fix: Passwords containing a
+
character are now escaped if passed via the--password
command line option.
- Update dependencies.
- Update dependencies.
- Use less memory for Text columns.
- Update dependencies
- Fix: Version number
- Fix: An issue with the mapping of ODBC data type FLOAT has been resolved. Before it had always been mapped to 32 Bit floating point precision. Now the precision of that column is also taken into account to map it to a 64 Bit floating point in case the precision exceeds 24.
-
Optimization: Required columns which do not require conversion to parquet types during fetch, are now no longer copied in the intermediate buffer. This will result in a little bit less memory usage and faster processing required (i.e. NOT NULL) columns with types:
- Double
- Real
- Float
- TinyInteger
- SmallInteger
- Integer
- Big Int
- and Decimals with Scale 0 and precision <= 18.
- Fix: An issue with the ODBC buffer allocated for
NUMERIC
andDECIMAL
types being two bytes to short, which could lead to wrong values being written into parquet without emmitting an error message.
- Both
--batch-size-row
and--batch-size-mib
can now be both specified together. - If no
--batch-size-*
limit is specified. A row limit of 65535 is now also applied by default, next to the size limit.
- Fixed an issue where large batch sizes could cause failures writing Boolean columns.
- Updated dependencies
- Updated dependencies
- Updated dependencies
- Allow specifyig fallback encodings for output parquet files then using the
query
subcommand.
- Updated dependencies.
- Better error message in case unixODBC version is outdated.
- Default log level is now warning.
--quiet
flag has been introduced to supress warnings, if desired. - Introduce the
--prefer-varbinary
flag to thequery
subcommand. It allows for mappingBINARY
SQL colmuns toBYTE_ARRAY
instead ofFIXED_LEN_BYTE_ARRAY
. This flag has been introduced in an effort to increase the compatibility of the output with spark.
- Fix: Columns for which the driver reported size zero were not ignored then UTF-16 encoding had been enabled. This is the default setting for Windows. These columns are now complete missing from the output file, instead of the column being present with all values being NULL or empty strings.
- Introduced support for connecting via GUI on windows platforms via
--prompt
flag.
- Introduced
--column-compression-default
in order to allow users to specify column compression. - Default column compression is now
gzip
.
- Requires at least Rust 1.51.0 to build.
- Command line parameters
user
andpassword
will no longer be ignored then passed together with a connection string. Instead their values will be appended asUID
andPWD
attributes at the end. --batch-size
command line flag has been renamed tobatch-size-row
.- Introduced
--batch-size-mib
command line flag to limit batch size based on memory usage - Default batch size is adapted so buffer allocation requires 2 GiB on 64 Bit platforms and 1 GiB on 32 Bit Platforms.
- Fix: There is now an error message produced if the resulting parquet file would not contain any columns.
- Add new sub command
insert
.
- Fix: Right truncation of values in fixed sized
NCHAR
columns had occurred if a character in the value did use more than one byte in UTF-8 encoding (or more than two bytes for UTF-16).
- Fix: On windows platforms the tool is now using UTF-16 encoding by default to exchange character data with the data source. The behaviour has been changed since on most windows platform the system locale is not configured to use UTF-8. The behaviour can be configured manually on any platform using the newly introduced
--encoding
option.
- Fix: Interior nuls within
VARCHAR
values did cause the tool to panic. Now these values are written into parquet as is.
- Fix: Replace non UTF-8 characters with the UTF-8 replacement character (
�
). ODBC encodes string according to the current locale, so this issue could cause non UTF-8 characters to be written into Parquet Text columns on Windows systems. If a non UTF-8 character is encountered a warning is generated hinting at the user to change to a UTF-8 locale.
VARBINARY
andBINARY
SQL columns are now mapped untoBYTE_ARRAY
andFIXED_LEN_BYTE_ARRAY
parquet physical types.
- Update dependencies
- Builds with stable Rust
- Update to
parquet 3.0.0
.
- Maps ODBC
Timestamp
s with precision <= 3 to parquetTIMESTAMP_MILLISECONDS
. - Updated dependencies
- Introduces option
--batches-per-file
in order to define an upper limit for batches in a single output file and split output across multiple files.
- Fix: Microsoft SQL Server user defined types with unbounded lengths have been mapped to Text columns with length zero. This caused at least one warning per row. These columns are now ignored at the beginning, causing exactly one warning. They also do no longer appear in the output schema.
- Fix: Allocate extra space in text column for multi byte UTF-8 characters.
- SQL Numeric and Decimal are now always mapped to the parquet Decimal independent of the precision or scale. The 32Bit or 62Bit "physical" Integer representation is chosen for SQL types with Scale Null and a precision smaller ten or 19, otherwise the "physical" type is to be a fixed size byte array.
- Fix: Tool could panic if too many warnings were generated at once.
- Introduces subcommand
list-data-sources
.
- Introduces subcommands.
query
is now required toquery
the database and store contents into parquet. - Introduces
drivers
subcommand.
- Adds support for parameterized queries.
- Fix: A major issue caused columns containing NULL values to either cause a panic or even worse, produce a parquet file with wrong data in the affected column without showing any error at all.
- Binary release of 32 Bit Window executable on GitHub
- Binary release for OS-X
- Connection string is no longer a positional argument.
- Allow connecting to an ODBC datasource using dsn.
- Binary release of 64 Bit Window executable on GitHub
- Maps ODBC
Bit
to ParquetBoolean
. - Maps ODBC
Tinyint
to ParquetINT 8
. - Maps ODBC
Real
to ParquetFloat
. - Maps ODBC
Numeric
same as it wouldDECIMAL
.
- Default row group size is now
100000
. - Adds support for Decimal types with precision 0..18 and scale = 0 (i.e. Everything that has a straightforward
i32
ori64
representation).
- Fix: Fixed an issue there some column types were not bound to the cursor, which let to some column only containing
NULL
or0
.
- Retrieve column names more reliably with a greater range of drivers.
- Log batch number and numbers of rows at info level.
- Log bound and detected ODBC type.
- Auto generate names for unnamed columns.
Initial release