You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the recommender system training, the user/item/history feature can be super large in production. Considering HPS as a multi-level cache, it can well store large sparse parameters, with cost of cache missing. If the number of keys in one query grow too large for the HPS, the delay may significantly increase.
Question
Is there N-step pipeline mechanism in HugeCTR to make the device cache always hold the next M-step lookup results?
Example
Assuming in current step N = 100, the hps has status: Device:
[0,2,4,6], Host: [0,1,2,3,4], Disk: [0,1,2,3,4,5,6,7,8,9,10,100]
Now a lookup request [0,1,2,3,100,500] comes. Then device will miss [1,3,100,500], host will miss [100,500], disk will miss [500].
The delay will grow while the cache-missing rate raise.
In replace, if there is a mechanism to make device predict and hold [0,1,2,3,100,500], before the lookup requests come, actively or passively, then all no cache-missing will happen. Which can reduce the delay obviously.
If M = 10, then hps was told to prefetch [0,1,2,3,100,500] in step N - M = 90.
The text was updated successfully, but these errors were encountered:
Lifann
changed the title
[Question] Is there pipeline mechanism to help the lookup requests always be blocked on device cache in HugeCTR?
[Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR?
Dec 21, 2023
@Lifann Thanks for your feedback. In order to understand your question more accurately. Let me first make it clear that the current GPU cache in HPS is only used for inference cases of recommender system, so the prefetching mechanism you described is difficult to implement in high-concurrency inference scenarios. However, we have implemented a high-performance lock-free GPU cache for inference to support concurrent lookup&insertion, which will be released in the near future.
Then in the training cases, we have implemented the prefetching mechanism you suggested in the ETC (Embedding Training Cache). Regarding the difference between ETC and HPS, you can refer to the #424.
However, we have deprecated the ETC, which will be replaced by HierarchicalKV on the training using hierarchical memory. In addition, we have integrated HKV into sok and can conduct seamless training on the tf platform. For how to use sok and HKV, you can also refer to the examples provided by #424
Background
In the recommender system training, the user/item/history feature can be super large in production. Considering HPS as a multi-level cache, it can well store large sparse parameters, with cost of cache missing. If the number of keys in one query grow too large for the HPS, the delay may significantly increase.
Question
Is there N-step pipeline mechanism in HugeCTR to make the device cache always hold the next M-step lookup results?
Example
Assuming in current step N = 100, the hps has status: Device:
[0,2,4,6], Host: [0,1,2,3,4], Disk: [0,1,2,3,4,5,6,7,8,9,10,100]
Now a lookup request [0,1,2,3,100,500] comes. Then device will miss [1,3,100,500], host will miss [100,500], disk will miss [500].
The delay will grow while the cache-missing rate raise.
In replace, if there is a mechanism to make device predict and hold [0,1,2,3,100,500], before the lookup requests come, actively or passively, then all no cache-missing will happen. Which can reduce the delay obviously.
If M = 10, then hps was told to prefetch [0,1,2,3,100,500] in step N - M = 90.
The text was updated successfully, but these errors were encountered: