-
-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranef
function for accumulating random effects without getting temporaries
#2237
Comments
How would we generalize this? One issue I see here is that we are going to need either a lot of explicit functions or some clever tuples and parameter packs scheme. I think an alternative approach here would be to allow sparse data into the language. Then all this sorting would just be done in the sparse matrix and the actual function itself would stay the same.
|
I always thought this is why we want an efficient sparse matrix stuff. The looping is just a hack and sparse matrix is the much better alternative here. |
Well there are a lot of options, that's part of the question. We could do something like the Maybe we could get around new user-facing functions or syntax by trying to automatically recognize terms with the compiler (and just forward compatible things to a special function) or maybe we could do it all with expressions (which is what I was asking @t4c1 about).
I don't see these things as competing and definitely don't think of either thing as a hack. I like having my variables broken out explicitly and not packed into one big vector. I have also heard from someone who prefers doing their random effects with the sparse matrix stuff that is currently in Stan. As far as Stan's use as an intermediate language, those use cases can do whatever is faster. But as far as Stan models used directly, both methods have different merits. As far as performance, they're doing very similar things, and will benefit from similar optimizations. I vaguely remember that the current sparse implementation is regarded as inefficient because it is full of checks and stuff (implementation here). To get a handle on this, I wrote and benchmarked the above model with a sparse matrix multiply that doesn't do any checks. Here is the code I used. The signature is meant to match the compressed row storage multiply in Stan currently: template <typename E, typename T,
require_eigen_t<E>* = nullptr,
require_var_matrix_t<T>* = nullptr>
inline T csr_matrix_times_vector_varmat(int N,
int L,
const E& w,
const std::vector<int>& v,
const std::vector<int>& u,
const T& b,
std::ostream*) {
typename T::value_type res_val(N);
arena_t<E> aw(w.size());
arena_t<std::vector<int>> av(v.size());
arena_t<std::vector<int>> au(u.size());
size_t idx = 0;
for(size_t n = 0; n < N; ++n) {
double val = 0.0;
size_t L = u[n + 1] - u[n];
for(size_t l = 0; l < L; ++l) {
val += w.coeff(idx) * b.val().coeff(v[idx] - 1);
aw.coeffRef(idx) = w.coeff(idx);
av[idx] = v[idx];
idx++;
}
au[n] = u[n];
res_val.coeffRef(n) = val;
}
au[au.size() - 1] = u[u.size() - 1];
T res = res_val;
reverse_pass_callback([N, aw, av, au, b, res]() mutable {
size_t idx = 0;
for(size_t n = 0; n < N; ++n) {
double adj = res.adj().coeffRef(n);
size_t L = au[n + 1] - au[n];
for(size_t l = 0; l < L; ++l) {
b.adj().coeffRef(av[idx] - 1) += aw.coeff(idx) * adj;
idx++;
}
}
});
return res;
} I ran 100 warmup and 100 draws of versions of the mrp model over here. There is a version of the model Using the The sparse matrix is presumably slower here because it's shuffling an extra variable through memory (the w vector -- since these are random effects with no intercepts the Both codes would get faster if data did not need saved into the arena for the reverse mode pass. Presumably this could be handled in both cases by storing variables from data and transformed data in arena variables so the code knows they'll be available on the reverse pass without copying. With a sparse matrix type we could probably avoid ever doing bounds checks after the data is read in. With In both cases, for additional performance we might consider lumping these things into functions like the Anyway that's a lot of stuff to say that I think these methods are complementary so I don't think sparse matrices outmodes this or anything. Here is code for building the sparse versions of the data from the non-sparse versions: make_sparse_data = function(data) {
L = data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$N_eth + data$N_male_eth + data$N_state
NNZ = 7 * data$N
w = rep(1.0, NNZ)
v = c()
u = rep(0, data$N + 1)
for(n in 1:data$N) {
v = c(v, c(data$J_age[n],
data$N_age + data$J_educ[n],
data$N_age + data$N_educ + data$J_educ_age[n],
data$N_age + data$N_educ + data$N_educ_age + data$J_educ_eth[n],
data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$J_eth[n],
data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$N_eth + data$J_male_eth[n],
data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$N_eth + data$N_male_eth + data$J_state[n]))
u[n] = 7 * (n - 1) + 1
}
u[data$N + 1] = NNZ + 1
list(N = data$N,
y = data$y,
K = data$K,
X = data$X,
L = L,
NNZ = NNZ,
w = w,
v = v,
u = u,
N_age = data$N_age,
N_educ = data$N_educ,
N_educ_age = data$N_educ_age,
N_educ_eth = data$N_educ_eth,
N_eth = data$N_eth,
N_male_eth = data$N_male_eth,
N_state = data$N_state,
prior_scale_sd = data$prior_scale_sd,
prior_scale_Intercept = data$prior_scale_Intercept,
prior_scale_b = data$prior_scale_b)
} Edit: Oh yeah here's the Eigen sparse matrix vector multiply for comparison: https://github.com/libigl/eigen/blob/master/Eigen/src/SparseCore/SparseDenseProduct.h#L95 There's a lot less fancy stuff. The random memory access makes it hard to do anything. |
I like this idea, so I put together a rough implementation for a generalised approach: https://godbolt.org/z/foP8xY It's not a reverse-mode specialisation like Ben's above, but it's more flexible in both the number of inputs and the function that can be applied. I've instead made an So let's say you have four vectors, and each has an std::vector of indexes. These indexes just need to be packed into an
Using this approach, the
EDIT: here's an example showing a ranef implementation in use: https://godbolt.org/z/KzenTv |
@bbbales2 I think your implementation in the top post is the best we can do performance-wise. We just need to add variadic templates to accept arbitrary many of (values, indices) pairs. Personally I would prefer values to be before indices in each pair. Also I like the name |
We can not use expressions here, as Eigen version we are using does not support indexing with vectors yet. Even if we used a version that does support that I don't think it would be any faster, as indexing makes memory accesses non-contiguous, preventing vectorization. |
@bbbales2 how did you get the per gradient timings in the model? |
@andrjohns that is neat that @t4c1 bummer on expressions. I guess we'd have to do something in the compiler to recognize the pattern, extra syntax, or just have a plain-ol' giant function. @SteveBronder some variation of:
You have to run it a while to avoid timing the overhead. In this case I was timing 100 post-warmup draws of the mrp_all model. |
Also to be clear, this isn't something I want to try to work out for 2.26 or anything. Just something I wanted to benchmark and revisit since the varmat stuff is coming alive. |
Description
For a linear model that looks like:
we can get a speed boost by doing all the random effects terms together and avoiding temporaries from the multi-indexes (
r1[i1]
), the elementwise multiplications, and the additions.With this model: https://github.com/bbbales2/computations_in_glms/blob/master/benchmark_model/mrp_varmat.stan
Replacing the code:
with:
Makes the per-gradient varmat timings go from about 9ms on my computer to about 7ms when using the
mrp_all.json
dataset in that repo. There's a discussion over here of a fancierranef
function.The very rough c++ implementation of
ranef
is:@t4c1 should we implement a
ranef
function like this? Or can we achieve the same thing transparently with expressions? Is the answer will be different format<var>
andvar<mat>
?Current Version:
v3.4.0
The text was updated successfully, but these errors were encountered: