-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Do not merge] Spike upgrading comps algorithm with taichi #236
base: master
Are you sure you want to change the base?
[Do not merge] Spike upgrading comps algorithm with taichi #236
Conversation
ENV RENV_CONFIG_SANDBOX_ENABLED FALSE | ||
ENV RENV_PATHS_LIBRARY renv/library | ||
ENV RENV_PATHS_CACHE /setup/cache | ||
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing CUDA appears to be a pain, so for the purposes of benchmarking this PR I just switched to a base image that comes with CUDA installed and refactored the Dockerfile to only install dependencies necessary for running python/comps.py
.
num_observations = observation_matrix.shape[0] | ||
num_price_bins = price_bin_matrix.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shape
and len
attributes are not available inside taichi kernels, so the recommended approach for these types of array metadata is to compute them outside the kernel and then pass them into the kernel as arguments.
observation_matrix = observation_df[["id", "predicted_value"]].values | ||
taichi_obs_ndarray = ti.ndarray(dtype=int, shape=observation_matrix.shape) | ||
taichi_obs_ndarray.from_numpy(observation_matrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Output vector | ||
binned_vector = ti.ndarray(dtype=int, shape=(num_observations, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taichi also cannot return fields from kernels, so we need to define the output data structure ahead of time and then mutate it in the kernel function.
for bin in price_bin_matrix: | ||
for obs_idx in range(num_observations): | ||
observation_price = observation_matrix[obs_idx, observation_price_idx] | ||
bin_found = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for
/else
constructs are not supported in taichi, so we have to use a flag variable instead.
top_n_idxs = np.full(num_comps, -1, dtype=idx_dtype) | ||
top_n_scores = np.zeros(num_comps, dtype=score_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays cannot be defined inside a taichi kernel, and must instead be defined outside its scope and then passed in as an argument; hence, we needed to refactor this function to remove the dynamic allocation of arrays inside the context of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch on the instance bump! Thanks for sharing your findings. I wonder if mentioning the CUDA interface to the ex-intern who might take a look at this would be helpful, although I suspect their class is teaching CUDA interface programming.
I would also be happy to try the CUDA interface, I worked with it a bit in graduate school. I think trying CuPy could also be worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool work. Sad that it didn't speed things up. Sorry for sending you down the rabbit hole. Let's leave this up as I'd like to test it with a local GPU as well just to make sure we're using full GPU capacity.
This PR tests out the use of taichi for the comps algorithm rather than numba.
Connects #229.
Findings
See
Benchmarking
below for detailed stats comparing this approach to our existing code on different architectures. Some high-level takeaways:Benchmarking
20k observations, 10k comparisons
100k observations, 50k comparisons