You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ColumnGeneratorCachedByIndex is recommended for new cached column generators, but it can be significantly slower than the not-recommended way of first creating a ColumnGenerator and then adding cache by wrapping with IndexCachedColumnGenerator.
The reason is that IndexCachedColumnGenerator will find all non-cached values and then process them at once (i.e., batch-wise), whereas the ColumnGeneratorCachedByIndex will always loop through all values. Thus, for an initial filling of the cache this can be much slower.
Not sure what to do here - one would need to redesign the ColumnGeneratorCachedByIndex to not use _generate_value, but that's a breaking change. Another way would be to write a new class a la VectorizedColumnGeneratorCachedByIndex, but I honestly feel like batch-wise processing of missing values should be the default behavior
The text was updated successfully, but these errors were encountered:
You are assuming that a vectorised computation mechanism for the value generation exists. In many applications, this won't be the case, and the reason for "ColumnGeneratorCachedByIndex is encouraged" is that it provides a more friendly interface for the case where the computation mechanism operates on a per item/row basis, i.e. creating one value given one row of input.
I think what we need to change is primarily the recommendation, i.e. it should be qualified.
Ok, I guess that's reasonable, since using the other way works well with vectorized columngens. Do you want to adjust the docstring? I feel like you usually don't like my style of writing too much ^^
The
ColumnGeneratorCachedByIndex
is recommended for new cached column generators, but it can be significantly slower than the not-recommended way of first creating a ColumnGenerator and then adding cache by wrapping withIndexCachedColumnGenerator
.The reason is that
IndexCachedColumnGenerator
will find all non-cached values and then process them at once (i.e., batch-wise), whereas theColumnGeneratorCachedByIndex
will always loop through all values. Thus, for an initial filling of the cache this can be much slower.Not sure what to do here - one would need to redesign the
ColumnGeneratorCachedByIndex
to not use_generate_value
, but that's a breaking change. Another way would be to write a new class a laVectorizedColumnGeneratorCachedByIndex
, but I honestly feel like batch-wise processing of missing values should be the default behaviorThe text was updated successfully, but these errors were encountered: