Skip to content

Commit

Permalink
Merge pull request #391 from glotzerlab/new-character-type
Browse files Browse the repository at this point in the history
Add new character type
  • Loading branch information
joaander authored Oct 18, 2024
2 parents 416a4f7 + 59e4213 commit c75aa35
Show file tree
Hide file tree
Showing 16 changed files with 261 additions and 104 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/build_wheels.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,11 @@ jobs:

python:
- version: 'cp310'
oldest_numpy: '1.21.6'
oldest_numpy: '2.0.0'
- version: 'cp311'
oldest_numpy: '1.23.2'
oldest_numpy: '2.0.0'
- version: 'cp312'
oldest_numpy: '1.26.2'
oldest_numpy: '2.0.0'
- version: 'cp313'
oldest_numpy: '2.1.1'

Expand All @@ -53,7 +53,7 @@ jobs:
uses: pypa/cibuildwheel@d4a2945fcc8d13f20a1b99d461b8e844d5fc6e23 # v2.21.1
env:
CIBW_BUILD: "${{ matrix.python.version }}-*"
CIBW_TEST_REQUIRES: pytest==8.2.1 numpy==${{ matrix.python.oldest_numpy }}
CIBW_TEST_REQUIRES: pytest==8.3.3 numpy==${{ matrix.python.oldest_numpy }}

- uses: actions/upload-artifact@50769540e7f4bd5e21e526ee35c689e35e0d6874 # v4.4.0
with:
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/unit_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ jobs:
- os: windows-2019
python: '3.12'
- os: windows-2022
python: '3.13.0-rc.1'
python: '3.13'
##############
# Mac
# macos-x86_64
- os: macos-13
python: '3.12'
# macos-arm64
- os: macos-14
python: '3.13.0-rc.1'
python: '3.13'
##############
# Ubuntu 24.04
- os: ubuntu-24.04
Expand All @@ -58,7 +58,7 @@ jobs:
c_compiler: clang-18
cxx_compiler: clang++-18
- os: ubuntu-24.04
python: '3.13.0-rc.1'
python: '3.13.0'
c_compiler: clang-18
cxx_compiler: clang++-18
##############
Expand Down
13 changes: 13 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,19 @@ Change Log
3.x
---

3.4.0 (not yet released)
^^^^^^^^^^^^^^^^^^^^^^^^

*Added:*

* New chunk type for string data - valid in file layer versions 2.1 and later
(`#391 <https://github.com/glotzerlab/gsd/pull/391>`__).

*Changed:*

* Require NumPy >= 2.0
(`#391 <https://github.com/glotzerlab/gsd/pull/391>`__).

3.3.2 (2024-09-06)
^^^^^^^^^^^^^^^^^^

Expand Down
2 changes: 1 addition & 1 deletion INSTALLING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Install prerequisites

* **C compiler** (tested with gcc 10-14, clang 10-18, Visual Studio 2019-2022)
* **Python** >= 3.10
* **numpy** >= 1.19.0
* **numpy** >= 2.0.0
* **Cython** >= 0.22

**To execute unit tests:**
Expand Down
1 change: 1 addition & 0 deletions doc/credits.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ The following people contributed to GSD.
* Alexander Stukowski, OVITO GmbH
* Charlotte Shiqi Zhao, University of Michigan
* Tim Moore, University of Michigan
* Joseph Burkhart, University of Michigan
14 changes: 10 additions & 4 deletions doc/file-layer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ File layer

.. highlight:: c

**Version: 2.0**
**Version: 2.x**

General simulation data (GSD) **file layer** design and rationale. These use
cases and design specifications define the low level GSD file format.
Expand Down Expand Up @@ -128,7 +128,7 @@ There are four types of data blocks in a GSD file.

* List of string names used by index entries.
* v1.0 files: Each name is a 64-byte character string.
* v2.0 files: Names may have any length and are separated by 0 terminators.
* v2.x files: Names may have any length and are separated by 0 terminators.
* The first name that starts with the 0 byte marks the end of the list
* The header stores the total size of the name list block.

Expand Down Expand Up @@ -215,13 +215,13 @@ non-standard packing attributes or pragmas to enforce this.
In v1.0 files, the frame index must monotonically increase from one index entry
to the next. The GSD API ensures this.

In v2.0 files, the entire index block is stored sorted first by frame, then
In v2.x files, the entire index block is stored sorted first by frame, then
by *id*.

Namelist block
^^^^^^^^^^^^^^

In v2.0 files, the namelist block stores a list of strings separated by 0
In v2.x files, the namelist block stores a list of strings separated by 0
terminators.

In v1.0 files, the namelist block stores a list of 0-terminated strings in
Expand All @@ -235,3 +235,9 @@ Data block
A data block stores raw data bytes on the disk. For a given index entry
``entry``, the data starts at location ``entry.location`` and is the next
``entry.N * entry.M * gsd_sizeof_type(entry.type)`` bytes.

Added in version 2.1
--------------------

* The ``GSD_CHARACTER`` chunk type represents a UTF-8 string (null termination is allowed, but not
required).
14 changes: 4 additions & 10 deletions doc/fl-examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -198,21 +198,15 @@ Store string chunks
application="My application",
schema="My Schema",
schema_version=[1,0])
f.mode
s = "This is a string"
b = numpy.array([s], dtype=numpy.dtype((bytes, len(s)+1)))
b = b.view(dtype=numpy.int8)
b
f.write_chunk(name='string', data=b)
f.write_chunk(name='string', data="This is a string")
f.end_frame()
r = f.read_chunk(frame=0, name='string')
r
r = r.view(dtype=numpy.dtype((bytes, r.shape[0])));
r[0].decode('UTF-8')
f.close()
To store a string in a gsd file, convert it to a numpy array of bytes and store
that data in the file. Decode the byte sequence to get back a string.
Staring with GSD 3.4.0, the file layer can natively store strings in the file.
In previous versions, you need to convert strings to a numpy array of bytes and store
that data in the file.

Truncate
^^^^^^^^
Expand Down
119 changes: 74 additions & 45 deletions gsd/fl.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -557,59 +557,78 @@ cdef class GSDFile:
if not self.__is_open:
raise ValueError("File is not open")

data_array = numpy.ascontiguousarray(data)
if data_array is not data:
logger.warning('implicit data copy when writing chunk: ' + name)
data_array = data_array.view()

cdef uint64_t N
cdef uint32_t M

if len(data_array.shape) > 2:
raise ValueError("GSD can only write 1 or 2 dimensional arrays: "
+ name)
cdef libgsd.gsd_type gsd_type
cdef void *data_ptr

if len(data_array.shape) == 1:
data_array = data_array.reshape([data_array.shape[0], 1])
# Special behavior for handling strings
if type(data) is str:
bytes_array = numpy.array([data], dtype=numpy.dtype((bytes, len(data))))
bytes_view = bytes_array.view(dtype=numpy.int8).reshape((len(data),1))

N = data_array.shape[0]
M = data_array.shape[1]
N = len(data)
M = 1

cdef libgsd.gsd_type gsd_type
cdef void *data_ptr
if data_array.dtype == numpy.uint8:
gsd_type = libgsd.GSD_TYPE_UINT8
data_ptr = __get_ptr_uint8(data_array)
elif data_array.dtype == numpy.uint16:
gsd_type = libgsd.GSD_TYPE_UINT16
data_ptr = __get_ptr_uint16(data_array)
elif data_array.dtype == numpy.uint32:
gsd_type = libgsd.GSD_TYPE_UINT32
data_ptr = __get_ptr_uint32(data_array)
elif data_array.dtype == numpy.uint64:
gsd_type = libgsd.GSD_TYPE_UINT64
data_ptr = __get_ptr_uint64(data_array)
elif data_array.dtype == numpy.int8:
gsd_type = libgsd.GSD_TYPE_INT8
data_ptr = __get_ptr_int8(data_array)
elif data_array.dtype == numpy.int16:
gsd_type = libgsd.GSD_TYPE_INT16
data_ptr = __get_ptr_int16(data_array)
elif data_array.dtype == numpy.int32:
gsd_type = libgsd.GSD_TYPE_INT32
data_ptr = __get_ptr_int32(data_array)
elif data_array.dtype == numpy.int64:
gsd_type = libgsd.GSD_TYPE_INT64
data_ptr = __get_ptr_int64(data_array)
elif data_array.dtype == numpy.float32:
gsd_type = libgsd.GSD_TYPE_FLOAT
data_ptr = __get_ptr_float32(data_array)
elif data_array.dtype == numpy.float64:
gsd_type = libgsd.GSD_TYPE_DOUBLE
data_ptr = __get_ptr_float64(data_array)
gsd_type = libgsd.GSD_TYPE_CHARACTER
data_ptr = __get_ptr_int8(bytes_view)

# Non-string behavior
else:
raise ValueError("invalid type for chunk: " + name)
data_array = numpy.ascontiguousarray(data)

if data_array is not data:
logger.warning('implicit data copy when writing chunk: ' + name)
data_array = data_array.view()



if len(data_array.shape) > 2:
raise ValueError("GSD can only write 1 or 2 dimensional arrays: "
+ name)

if len(data_array.shape) == 1:
data_array = data_array.reshape([data_array.shape[0], 1])

N = data_array.shape[0]
M = data_array.shape[1]

if data_array.dtype == numpy.uint8:
gsd_type = libgsd.GSD_TYPE_UINT8
data_ptr = __get_ptr_uint8(data_array)
elif data_array.dtype == numpy.uint16:
gsd_type = libgsd.GSD_TYPE_UINT16
data_ptr = __get_ptr_uint16(data_array)
elif data_array.dtype == numpy.uint32:
gsd_type = libgsd.GSD_TYPE_UINT32
data_ptr = __get_ptr_uint32(data_array)
elif data_array.dtype == numpy.uint64:
gsd_type = libgsd.GSD_TYPE_UINT64
data_ptr = __get_ptr_uint64(data_array)
elif data_array.dtype == numpy.int8:
gsd_type = libgsd.GSD_TYPE_INT8
data_ptr = __get_ptr_int8(data_array)
elif data_array.dtype == numpy.int16:
gsd_type = libgsd.GSD_TYPE_INT16
data_ptr = __get_ptr_int16(data_array)
elif data_array.dtype == numpy.int32:
gsd_type = libgsd.GSD_TYPE_INT32
data_ptr = __get_ptr_int32(data_array)
elif data_array.dtype == numpy.int64:
gsd_type = libgsd.GSD_TYPE_INT64
data_ptr = __get_ptr_int64(data_array)
elif data_array.dtype == numpy.float32:
gsd_type = libgsd.GSD_TYPE_FLOAT
data_ptr = __get_ptr_float32(data_array)
elif data_array.dtype == numpy.float64:
gsd_type = libgsd.GSD_TYPE_DOUBLE
data_ptr = __get_ptr_float64(data_array)
else:
raise ValueError("invalid type for chunk: " + name)

# Once we have the data pointer, the behavior should be identical
# for all data types
logger.debug('write chunk: ' + self.name + ' - ' + name)

cdef char * c_name
Expand Down Expand Up @@ -787,6 +806,9 @@ cdef class GSDFile:
elif gsd_type == libgsd.GSD_TYPE_DOUBLE:
data_array = numpy.empty(dtype=numpy.float64,
shape=[index_entry.N, index_entry.M])
elif gsd_type == libgsd.GSD_TYPE_CHARACTER:
data_array = numpy.empty(dtype=numpy.int8,
shape=[index_entry.M, index_entry.N])
else:
raise ValueError("invalid type for chunk: " + name)

Expand Down Expand Up @@ -815,6 +837,8 @@ cdef class GSDFile:
data_ptr = __get_ptr_float32(data_array)
elif gsd_type == libgsd.GSD_TYPE_DOUBLE:
data_ptr = __get_ptr_float64(data_array)
elif gsd_type == libgsd.GSD_TYPE_CHARACTER:
data_ptr = __get_ptr_int8(data_array)
else:
raise ValueError("invalid type for chunk: " + name)

Expand All @@ -826,6 +850,11 @@ cdef class GSDFile:
__raise_on_error(retval, self.name)

if index_entry.M == 1:
if gsd_type == libgsd.GSD_TYPE_CHARACTER:
data_array = data_array.flatten()
bytes_array = data_array.view(dtype=numpy.dtype((bytes, data_array.shape[0])))
return bytes_array[0].decode("UTF-8")

return data_array.reshape([index_entry.N])
else:
return data_array
Expand Down
37 changes: 33 additions & 4 deletions gsd/gsd.c
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,12 @@ enum
/// Current GSD file specification
enum
{
GSD_CURRENT_FILE_VERSION = 2
GSD_CURRENT_FILE_VERSION_MAJOR = 2
};

enum
{
GSD_CURRENT_FILE_VERSION_MINOR = 1
};

// define windows wrapper functions
Expand Down Expand Up @@ -1384,7 +1389,8 @@ gsd_initialize_file(int fd, const char* application, const char* schema, uint32_
gsd_util_zero_memory(&header, sizeof(header));

header.magic = GSD_MAGIC_ID;
header.gsd_version = gsd_make_version(GSD_CURRENT_FILE_VERSION, 0);
header.gsd_version
= gsd_make_version(GSD_CURRENT_FILE_VERSION_MAJOR, GSD_CURRENT_FILE_VERSION_MINOR);
strncpy(header.application, application, sizeof(header.application) - 1);
header.application[sizeof(header.application) - 1] = 0;
strncpy(header.schema, schema, sizeof(header.schema) - 1);
Expand Down Expand Up @@ -1607,6 +1613,24 @@ inline static int gsd_initialize_handle(struct gsd_handle* handle)
handle->maximum_write_buffer_size = GSD_DEFAULT_MAXIMUM_WRITE_BUFFER_SIZE;
handle->index_entries_to_buffer = GSD_DEFAULT_INDEX_ENTRIES_TO_BUFFER;

// Silently upgrade writable files from a previous matching major version to the latest
// minor version.
if ((handle->open_flags == GSD_OPEN_READWRITE || handle->open_flags == GSD_OPEN_APPEND)
&& (handle->header.gsd_version
!= gsd_make_version(GSD_CURRENT_FILE_VERSION_MAJOR, GSD_CURRENT_FILE_VERSION_MINOR))
&& (handle->header.gsd_version >> (sizeof(uint32_t) * 4) == GSD_CURRENT_FILE_VERSION_MAJOR))
{
handle->header.gsd_version
= gsd_make_version(GSD_CURRENT_FILE_VERSION_MAJOR, GSD_CURRENT_FILE_VERSION_MINOR);
size_t bytes_written
= gsd_io_pwrite_retry(handle->fd, &(handle->header), sizeof(struct gsd_header), 0);

if (bytes_written != sizeof(struct gsd_header))
{
return GSD_ERROR_IO;
}
}

return GSD_SUCCESS;
}

Expand Down Expand Up @@ -2342,6 +2366,10 @@ size_t gsd_sizeof_type(enum gsd_type type)
{
val = sizeof(double);
}
else if (type == GSD_TYPE_CHARACTER)
{
val = sizeof(char);
}
else
{
return 0;
Expand Down Expand Up @@ -2554,8 +2582,9 @@ int gsd_upgrade(struct gsd_handle* handle)
}
}

// label the file as a v2.0 file
handle->header.gsd_version = gsd_make_version(GSD_CURRENT_FILE_VERSION, 0);
// GSD always writes files matching the current major and minor version.
handle->header.gsd_version
= gsd_make_version(GSD_CURRENT_FILE_VERSION_MAJOR, GSD_CURRENT_FILE_VERSION_MINOR);

// write the new header out
ssize_t bytes_written
Expand Down
5 changes: 4 additions & 1 deletion gsd/gsd.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,10 @@ extern "C"
GSD_TYPE_FLOAT,

/// 64-bit floating point number.
GSD_TYPE_DOUBLE
GSD_TYPE_DOUBLE,

/// 8-bit character.
GSD_TYPE_CHARACTER
};

/// Flag for GSD file open options
Expand Down
Loading

0 comments on commit c75aa35

Please sign in to comment.