[Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? #437

Lifann · 2023-12-21T08:45:27Z

Background

In the recommender system training, the user/item/history feature can be super large in production. Considering HPS as a multi-level cache, it can well store large sparse parameters, with cost of cache missing. If the number of keys in one query grow too large for the HPS, the delay may significantly increase.

Question

Is there N-step pipeline mechanism in HugeCTR to make the device cache always hold the next M-step lookup results?

Example

Assuming in current step N = 100, the hps has status: Device:
[0,2,4,6], Host: [0,1,2,3,4], Disk: [0,1,2,3,4,5,6,7,8,9,10,100]

Now a lookup request [0,1,2,3,100,500] comes. Then device will miss [1,3,100,500], host will miss [100,500], disk will miss [500].

The delay will grow while the cache-missing rate raise.

In replace, if there is a mechanism to make device predict and hold [0,1,2,3,100,500], before the lookup requests come, actively or passively, then all no cache-missing will happen. Which can reduce the delay obviously.

If M = 10, then hps was told to prefetch [0,1,2,3,100,500] in step N - M = 90.

yingcanw · 2023-12-27T10:51:18Z

@Lifann Thanks for your feedback. In order to understand your question more accurately. Let me first make it clear that the current GPU cache in HPS is only used for inference cases of recommender system, so the prefetching mechanism you described is difficult to implement in high-concurrency inference scenarios. However, we have implemented a high-performance lock-free GPU cache for inference to support concurrent lookup&insertion, which will be released in the near future.

Then in the training cases, we have implemented the prefetching mechanism you suggested in the ETC (Embedding Training Cache). Regarding the difference between ETC and HPS, you can refer to the #424.

However, we have deprecated the ETC, which will be replaced by HierarchicalKV on the training using hierarchical memory. In addition, we have integrated HKV into sok and can conduct seamless training on the tf platform. For how to use sok and HKV, you can also refer to the examples provided by #424

Lifann added the question Further information is requested label Dec 21, 2023

Lifann changed the title ~~[Question] Is there pipeline mechanism to help the lookup requests always be blocked on device cache in HugeCTR?~~ [Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? Dec 21, 2023

zehuanw assigned yingcanw Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? #437

[Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? #437

Lifann commented Dec 21, 2023 •

edited

Loading

yingcanw commented Dec 27, 2023

[Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? #437

[Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? #437

Comments

Lifann commented Dec 21, 2023 • edited Loading

Background

Question

Example

yingcanw commented Dec 27, 2023

Lifann commented Dec 21, 2023 •

edited

Loading