Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merge] Spike upgrading comps algorithm with taichi #236

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Apr 29, 2024

⚠️ I don't plan to merge this PR, but am requesting review for the sake of knowledge transfer. ⚠️

This PR tests out the use of taichi for the comps algorithm rather than numba.

Connects #229.

Findings

See Benchmarking below for detailed stats comparing this approach to our existing code on different architectures. Some high-level takeaways:

  • CUDA doesn't seem to make much of a difference, and is counterproductive if anything. This makes me wonder whether the algorithm needs to be redesigned to make better use of the GPU (note that numba has an entirely separate interface for CUDA programming), but I'm considering that question out of scope for now.
  • There are big performance gains to be had by simply bumping the instance type with the existing numba code. If the numbers below hold, we could speed up the comps code by 2x if we switched to c5.24xlarge instances, which are about twice as expensive as the m4.10xlarge instances we use now (meaning we should expect to break even on the change).
  • At small scales (20k observations/10k comparisons), taichi appears to outperform numba, but this improvement disappears if we scale up the size of the data. At a large scale (100k observations/50k comparisons), they perform about the same.
  • As evidenced by the code in this PR, the taichi interface is harder to work with than numba. Taichi is more strict about types, and doesn't support some basic operations that Python does (most notably, getting the shape of an array and returning an array from a function) making the code more confusing and un-Pythonic.

Benchmarking

20k observations, 10k comparisons

framework instance type arch time logs
taichi g5.12xlarge x86 2.36s link
taichi g5.12xlarge CUDA 4.33s link
taichi m4.10xlarge x86 4.44s link
numba g5.12xlarge x86 6.07s link
numba m4.10xlarge x86 10.52s link

100k observations, 50k comparisons

framework instance type arch time logs
numba g5.12xlarge x86 31.87s link
taichi c5.24xlarge x86 31.93s link
taichi m4.10xlarge x86 34.09s link
numba c5.24xlarge x86 37.31s link
taichi g5.12xlarge x86 37.75s link
taichi g5.12xlarge CUDA 43.58s link
numba m4.10xlarge x86 64.19s link

ENV RENV_CONFIG_SANDBOX_ENABLED FALSE
ENV RENV_PATHS_LIBRARY renv/library
ENV RENV_PATHS_CACHE /setup/cache
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing CUDA appears to be a pain, so for the purposes of benchmarking this PR I just switched to a base image that comes with CUDA installed and refactored the Dockerfile to only install dependencies necessary for running python/comps.py.

Comment on lines +108 to +109
num_observations = observation_matrix.shape[0]
num_price_bins = price_bin_matrix.shape[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape and len attributes are not available inside taichi kernels, so the recommended approach for these types of array metadata is to compute them outside the kernel and then pass them into the kernel as arguments.

Comment on lines +100 to +102
observation_matrix = observation_df[["id", "predicted_value"]].values
taichi_obs_ndarray = ti.ndarray(dtype=int, shape=observation_matrix.shape)
taichi_obs_ndarray.from_numpy(observation_matrix)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taichi cannot automatically convert numpy arrays to the data structures that it works with (called fields) so we have to do that manually prior to passing them into the kernel.

Comment on lines +111 to +112
# Output vector
binned_vector = ti.ndarray(dtype=int, shape=(num_observations, 1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taichi also cannot return fields from kernels, so we need to define the output data structure ahead of time and then mutate it in the kernel function.

for bin in price_bin_matrix:
for obs_idx in range(num_observations):
observation_price = observation_matrix[obs_idx, observation_price_idx]
bin_found = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for/else constructs are not supported in taichi, so we have to use a flag variable instead.

Comment on lines -249 to -250
top_n_idxs = np.full(num_comps, -1, dtype=idx_dtype)
top_n_scores = np.zeros(num_comps, dtype=score_dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays cannot be defined inside a taichi kernel, and must instead be defined outside its scope and then passed in as an argument; hence, we needed to refactor this function to remove the dynamic allocation of arrays inside the context of the function.

@jeancochrane jeancochrane changed the title Spike upgrading comps algorithm with taichi [Do not merge] Spike upgrading comps algorithm with taichi May 1, 2024
@jeancochrane jeancochrane marked this pull request as ready for review May 1, 2024 16:20
@jeancochrane jeancochrane requested a review from dfsnow as a code owner May 1, 2024 16:20
Copy link
Member

@wagnerlmichael wagnerlmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the instance bump! Thanks for sharing your findings. I wonder if mentioning the CUDA interface to the ex-intern who might take a look at this would be helpful, although I suspect their class is teaching CUDA interface programming.

I would also be happy to try the CUDA interface, I worked with it a bit in graduate school. I think trying CuPy could also be worthwhile.

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool work. Sad that it didn't speed things up. Sorry for sending you down the rabbit hole. Let's leave this up as I'd like to test it with a local GPU as well just to make sure we're using full GPU capacity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spike upgrading comps algorithm with taichi
3 participants