Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zserio::Array is cumbersome to use #67

Open
MikeLankamp-TomTom opened this issue Jan 14, 2025 · 6 comments
Open

zserio::Array is cumbersome to use #67

MikeLankamp-TomTom opened this issue Jan 14, 2025 · 6 comments

Comments

@MikeLankamp-TomTom
Copy link

MikeLankamp-TomTom commented Jan 14, 2025

I'd like to start a discussion about zserio::Array's ease of use. It takes several template parameters:

  • The underlying storage to the raw data (e.g. std::vector<T>).
  • The type of array (normal, implicit, aligned, auto, auto aligned).
  • Whether it's (im)mutable.
  • The array's traits, which e.g. returns views around raw data.

This is quite cumbersome to use. For example, if I have a compound type with two arrays, and I want to write a function that returns either of the arrays for a given object as a reference (because I don't want to pay for a copy), I have to specify quite a full type, e.g.:

const zserio::Array<std::vector<zserio::View<SomeType>>, zserio::ArrayType::NORMAL>& GetRelevantField(const zserio::View<SomeCompound>& compound);

This gets even more complex if the storage isn't immutable.

Instead, I'd prefer to be able to write something simple like this:

const zserio::Array<zserio::View<SomeType>>& GetRelevantField(const zserio::View<SomeCompound>& compound);

I believe this should be possible in principle, because ArrayType is only relevant for serialization. Perhaps this can be removed from the Array type, and passed as harcoded argument when writing. It's really something that users of the API shouldn't have to care about.

Similarly, at first glance, ArrayStorage seems weird to have. Why is having a const Array<...>& not good enough? Why must the Array itself know whether it's mutable or not?

And since ARRAY_TRAITS is taken from a specialization of zserio::ArrayTraits<T> anyway, why bother having it as a template argument at all?

Edit: about ARRAY_TRAITS, I see now that it's not always from zserio::ArrayTraits. It's possible to have two arrays with the same value type, but have different traits because they come from e.g. different members of a compound struct. This means I cannot use the same code to handle them both, unless I template it. I'd still like it if this array traits were somehow not a part of the type.

Lastly, is there a use case for having something other than std::vector as storage? Perhaps if we wish to keep supporting this, we can change the definition to the following, to simplify the common case:

template <typename VALUE_TYPE, typename STORAGE_TYPE = std::vector<VALUE_TYPE>>
class Array;

Edit: I also see that Array doesn't own its storage, so perhaps ArrayView or Span is a better name, for consistency and clarity? i.e. ArrayView<SomeType> (also known as a Span, in which case a begin/end iterator is enough, and the underlying storage isn't needed at all?)

@mikir
Copy link
Collaborator

mikir commented Jan 14, 2025

Thanks for a comprehensive description and for sharing the ideas.

First of all, your last note is correct. Array doesn't own its storage, it is actually View to Zserio array. This View holds reference to raw array and creates Views to elements on demand (e.g. imagine the big array where access to only one element is needed). We could rename it to ArrayView if it helps. These array views are created by Zserio and are not intended to be created by users. Then, users could use auto keyword instead of the whole template specification.

Similarly, at first glance, ArrayStorage seems weird to have. Why is having a const Array<...>& not good enough? Why must the Array itself know whether it's mutable or not?

This is necessary for the indexed offsets. Offsets are evil. We need to have possibility to set elements in the array used as an indexed offset array. Therefore, we need to support mutable arrays as well which must hold non-const reference to the underlying array. Currently, indexed offsets arrays are implemented as mutable std::vector<> in Data.

This is quite cumbersome to use. For example, if I have a compound type with two arrays, and I want to write a function that returns either of the arrays for a given object as a reference (because I don't want to pay for a copy), I have to specify quite a full type, e.g.

If you need to work with all Views to elements, you can copy the ArrayView to the separate std::vector<View> to initialize forcibly all Views to elements. This should be now possible because issue #66 has been already resolved.

And since ARRAY_TRAITS is taken from a specialization of zserio::ArrayTraits anyway, why bother having it as a template argument at all?

We agree that template arguments are annoying. We were not able to find out better solution for that. We could probably move them to the runtime as the constructor arguments. But this will bring some performance drop together with bigger memory allocations.

Lastly, is there a use case for having something other than std::vector as storage? Perhaps if we wish to keep supporting this, we can change the definition to the following, to simplify the common case:

If we change template argument RAW_ARRAY to the VALUE_TYPE, we will need to specify the ALLOCATOR for used underlying std::vector.

@MikeLankamp-TomTom
Copy link
Author

If you need to work with all Views to elements, you can copy the ArrayView to the separate std::vector to initialize forcibly all Views to elements.

I don't want to copy anything. I want to write a function that works on an Array (as a light-weight view), e.g.:

void process(Array<...> data);

Now I have to potentially spell out multiple verbose template arguments, plus the underlying storage. Whereas I just want to work over a sequence of elements of type T, so I just want to type Array<T>. I don't want to have to copy or allocate anything.

Then, users could use auto keyword instead of the whole template specification.

See my previous point: for methods, that's an issue. auto arguments are templates in disguise. And if I have a header/source separation, I can't even do that properly. I have to spell it out.

If we change template argument RAW_ARRAY to the VALUE_TYPE, we will need to specify the ALLOCATOR for used underlying std::vector.

But Array is a view, it shouldn't allocate anything, it should just have a reference to the storage. Or even better, a raw pointer + size, like std::span. Right?

For the sake of argument, let's re-imagine Array<...> into Span<T>, where iterator dereferencing and indexing need to create a View<T> on-the-fly. Open points:

  • indexed offsets: are these offsets not modified during serialization? Do they have to use the Array/Span API? Can't this offset logic bypass the whole "view" aspect and just modify offsets in the raw data? In other words: can't we restrict Array/Span to be truly read-only? Views are read-only too, and they might contain offsets in their data?
  • Specifying an allocator: as I mentioned, why? Store a 'const T*andstd::size_t, that's all you need. It just requires that all backing storages are contiguous (like std::span`).
  • Traits: so the big hurdle is how to have access to the parameters to create Views on the fly during dereferencing/indexing. Since normal Views store their parameters, can't an Array/Span do the same? So a Span<T>, besides storing const *T and std::size_t, will also store the parameters for T, by value? For indexed parameters, I guess it would have to own a std::vector or something?

@Mi-La
Copy link
Collaborator

Mi-La commented Jan 14, 2025

Array can probably hold Span<T> instead of std::vector<T>, it's a good idea to get rid of the allocators completely.

For setting of indexed offsets we need to know parameters since the offset expression can contain parameters - e.g. offset field lies in compound provided as a parameter. So we need some kind of View.

Views can store parameters because Views are generated, while Array is only a single template in runtime. Since elements can use even the parameters from the compound which owns the Array, we store the View to the owner within the Array.

I understand that it's not nice to specify the full type manually, but since you will work with Array<T>, I would expect to have such a method implemented as a template. Then it would be easy to have it like:

template <ARRAY>
void process(ARRAY& array)
{...}

It seems weird to me to have implementation for one specific type in a source file (cpp).

In all cases we were not able to find out better solution for template arguments as Miki wrote above.
However:

  • we could be able to get rid of ARRAY_TYPE by implementing multiple method for readAuto, readNormal, writeAuto etc.
  • we could be able to get rid of STORAGE_TYPE if we implement dedicated solution for offsets (similar abstraction to View just for offsets), but it would be significant amount of work and since our capacity is limited we have chosen current simple approach with mutable fields.
  • ARRAY_TRAITS will be necessary to be able to implement e.g. at() method but the default argument can make it easier for common cases.

Note that even now you should be able to use just Array<SomeCompound, ArrayType::AUTO> for all types which don't contain offsets and don't need special array traits.

@Mi-La
Copy link
Collaborator

Mi-La commented Jan 14, 2025

Another idea is that you could use your own typedef for the array type, something like:

using ArrayType = decltype(std::declval<View<SomeObject>>().arrayField());
void process(const ArrayType& array);

@MikeLankamp-TomTom
Copy link
Author

Views can store parameters because Views are generated, while Array is only a single template in runtime.

And what if we generate Arrays? Could keep ArrayBase (with the span) generic.

Since elements can use even the parameters from the compound which owns the Array, we store the View to the owner within the Array.

Right, and this last bit creates the problem: two arrays with the same value_type are not necessarily the same type. This is counter-intuitive and doesn't help usability. Thinking about it, copying the parameters into the Array isn't great (because you'd need a vector with parameters for each element, since that is the worst case). Let me think about this.

Then it would be easy to have it like:

As I mentioned, using templates whenever ANY Array is needed isn't nice for compilation time, readability, and doesn't work well if the code is split over header/source.

It seems weird to me to have implementation for one specific type in a source file (cpp).

Not really. Want to convert an Array of Position2Ds to some properietary Line2D type? That could be a utility method. Or some member of a class that needs to work on an Array, but also on member functions, then it needs to be a private method declared in the header, while the definition can be in the source file.

we could be able to get rid of ARRAY_TYPE by implementing multiple method for readAuto, readNormal, writeAuto etc.

Sure, or do read(ArrayType type) and write(ArrayType type). Either's fine.

we could be able to get rid of STORAGE_TYPE if we implement dedicated solution for offsets (similar abstraction to View just for offsets)

I understand capacity is an issue, and we could leave this as-is for now since there might be more important things to work on, but I would prefer this is changed before actual release.

Another idea is that you could use your own typedef for the array type, something like:

Yes, but it's all overhead. It feels like it shouldn't be necessary. I'd just like to explore the alternatives. The point of this new generator is also to make the using code easier to write and maintain.

ARRAY_TRAITS will be necessary to be able to implement e.g. at()

Why? It's just to get access to parameters to construct a View, right? If you generate Arrays like Views, that's not necessary anymore?

@mikir
Copy link
Collaborator

mikir commented Jan 15, 2025

The generation of the arrays is, of course, possible. However, this would present additional problems:

  • The generated source code could become very large in the case of numerous arrays.
  • The generated source code would be less readable.
  • Testing would become significantly more difficult. This matters because of our capacity. For example, the old C++11 generator currently has over 2000 integration tests implemented across more than 50,000 lines of code.

In all cases, such discussions are highly valuable because we are able to find out together better technical solution.

Because we cannot do everything immediately, I suggest the following approach:

  • We will continue with the current implementation to be able to finish this emitter as soon as possible.
  • We will discuss and create corresponding issue for each minor improvement.
  • Once the implementation is complete, we can prioritize and resolve the most critical issues from the backlog to improve the generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants