HugeCTR User Guide

HugeCTR is a GPU-accelerated framework designed to distribute training across multiple GPUs and nodes and estimate click-through rates (CTRs). HugeCTR supports model-parallel embedding tables and data-parallel neural networks and their variants such as Wide and Deep Learning (WDL), Deep Cross Network (DCN), DeepFM, and Deep Learning Recommendation Model (DLRM). HugeCTR is a component of NVIDIA Merlin Open Beta. NVIDIA Merlin is used for building large-scale recommender systems, which require massive datasets to train, particularly for deep learning based solutions.

Fig. 1: Merlin Architecture

To prevent data loading from becoming a major bottleneck during training, HugeCTR contains a dedicated data reader that is inherently asynchronous and multi-threaded. It will read a batched set of data records in which each record consists of high-dimensional, extremely sparse, or categorical features. Each record can also include dense numerical features, which can be fed directly to the fully connected layers. An embedding layer is used to compress the sparse input features to lower-dimensional, dense embedding vectors. There are three GPU-accelerated embedding stages:

Table lookup
Weight reduction within each slot
Weight concatenation across the slots

To enable large embedding training, the embedding table in HugeCTR is model parallel and distributed across all GPUs in a homogeneous cluster, which consists of multiple nodes. Each GPU has its own:

feed-forward neural network (data parallelism) to estimate CTRs.
hash table to make the data preprocessing easier and enable dynamic insertion.

Embedding initialization is not required before training takes place since the input training data are hash values (64-bit signed integer type) instead of original indices. A pair of <key,value> (random small weight) will be inserted during runtime only when a new key appears in the training data and the hash table cannot find it.

Fig. 2: HugeCTR Architecture

Fig. 3: Embedding Architecture

Fig. 4: Embedding Mechanism

Installing and Building HugeCTR

You can either install HugeCTR easily using the Merlin Docker image in NGC, or build HugeCTR from scratch using various build options if you're an advanced user.

Compute Capability

We support the following compute capabilities:

Compute Capability	GPU	SM
6.0	NVIDIA P100 (Pascal)	60
7.0	NVIDIA V100 (Volta)	70
7.5	NVIDIA T4 (Turing)	75
8.0	NVIDIA A100 (Ampere)	80

Support Matrix

For more information about our support matrix, refer to Support Matrix.

Installing HugeCTR Using NGC Containers

All NVIDIA Merlin components are available as open source projects. However, a more convenient way to utilize these components is by using our Merlin NGC containers. These containers allow you to package your software application, libraries, dependencies, and runtime compilers in a self-contained environment. When installing HugeCTR using NGC containers, the application environment remains portable, consistent, reproducible, and agnostic to the underlying host system's software configuration.

HugeCTR is included in the Merlin Docker containers that are available from the NVIDIA container repository. You can query the collection for containers that match the HugeCTR label. The following table also identifies the containers:

Container Name	Container Location	Functionality
merlin-inference	https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-inference	NVTabular, HugeCTR, and Triton Inference
merlin-training	https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-training	NVTabular and HugeCTR
merlin-tensorflow-training	https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow-training	NVTabular, TensorFlow, and HugeCTR Tensorflow Embedding plugin

To use these Docker containers, you'll first need to install the NVIDIA Container Toolkit to provide GPU support for Docker. You can use the NGC links referenced in the table above to obtain more information about how to launch and run these containers.

The following sample command pulls and starts the Merlin Training container:

# Run the container in interactive mode
$ docker run --gpus=all --rm -it --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-training:22.03

Building HugeCTR from Scratch

To build HugeCTR from scratch, refer to Build HugeCTR from source code.

Use Cases

With the release of HugeCTR version 3.1, training can no longer be performed using the command line and JSON configuration file. You can construct the model graph and train the model with HugeCTR Python Interface. It is worth mentioning that the branch topology is inherently supported by HugeCTR model graph, and extra layers are abstracted away in Python interface. To learn how to construct a model graph with branches, refer to Getting Started. For more information regarding how to use the HugeCTR Python API and comprehend its API signature, refer to Python Interface.

Core Features

In addition to single-node and full precision training, HugeCTR supports a variety of features including the following:

Model Parallel Training
Multi-Node Training
Mixed Precision Training
SGD Optimizer and Learning Rate Scheduling
Embedding Training Cache
HugeCTR to ONNX Converter
Hierarchical Parameter Server

NOTE: Multi-node training and mixed precision training can be used simultaneously.

Model Parallel Training

HugeCTR natively supports both model parallel and data parallel training, making it possible to train very large models on GPUs. Features and categories of embeddings can be distributed across multiple GPUs and nodes. For example, if you have two nodes with 8xA100 80GB GPUs, you can train models that are as large as 1TB fully on GPU. By using the embedding training cache, you can train even larger models on the same nodes.

To achieve the best performance on different embeddings, use various embedding layer implementations. Each of these implementations target different practical training cases such as:

LocalizedSlotEmbeddingHash: The features in the same slot (feature field) will be stored in one GPU, which is why it's referred to as a “localized slot”, and different slots may be stored in different GPUs according to the index number of the slot. LocalizedSlotEmbedding is optimized for instances where each embedding is smaller than the memory size of the GPU. As local reduction for each slot is used in the LocalizedSlotEmbedding with no global reduction between GPUs, the overall data transaction in the LocalizedSlotEmbedding is much less than the DistributedSlotEmbedding.

Note: Make sure that there aren't any duplicated keys in the input dataset.
DistributedSlotEmbeddingHash: All the features, which are located in different feature fields / slots, are distributed to different GPUs according to the index number of the feature regardless of the slot index number. That means the features in the same slot may be stored in different GPUs, which is why it's referred to as a “distributed slot”. Since global reduction is required, the DistributedSlotEmbedding was developed for cases where the embeddings are larger than the memory size of the GPU. DistributedSlotEmbedding has much more memory transactions between GPUs.

Note: Make sure that there aren't any duplicated keys in the input dataset.
LocalizedSlotEmbeddingOneHot: A specialized LocalizedSlotEmbedding that requires a one-hot data input. Each feature field must also be indexed from zero. For example, gender: 0,1; 1,2 wouldn't be considered correctly indexed.

Multi-Node Training

Multi-node training makes it easy to train an embedding table of arbitrary size. In a multi-node solution, the sparse model, which is referred to as the embedding layer, is distributed across the nodes. Meanwhile, the dense model, such as DNN, is data parallel and contains a copy of the dense model in each GPU (see Fig. 2). With our implementation, HugeCTR leverages NCCL for high speed and scalable inter-node and intra-node communication.

To run with multiple nodes, HugeCTR should be built with OpenMPI. GPUDirect RDMA support is recommended for high performance. For more information, refer to our DCN multi-node training sample.

Mixed Precision Training

Mixed precision training is supported to help improve and reduce the memory throughput footprint. In this mode, TensorCores are used to boost performance for matrix multiplication-based layers, such as FullyConnectedLayer and InteractionLayer, on Volta, Turing, and Ampere architectures. For the other layers, including embeddings, the data type is changed to FP16 so that both memory bandwidth and capacity are saved. To enable mixed precision mode, specify the mixed_precision option in the configuration file. When mixed_precision is set, the full FP16 pipeline will be triggered. Loss scaling will be applied to avoid the arithmetic underflow (see Fig. 5). Mixed precision training can be enabled using the configuration file.

Fig. 5: Arithmetic Underflow

SGD Optimizer and Learning Rate Scheduling

Learning rate scheduling allows users to configure its hyperparameters, which include the following:

learning_rate: Base learning rate.
warmup_steps: Number of initial steps used for warm-up.
decay_start: Specifies when the learning rate decay starts.
decay_steps: Decay period (in steps).

Fig. 6 illustrates how these hyperparameters interact with the actual learning rate.

For more information, refer to Python Interface.

Fig. 6: Learning Rate Scheduling

Embedding Training Cache

Embedding Training Cache (ETC) gives you the ability to train a large model up to terabytes. It's implemented by loading a subset of an embedding table, which exceeds the aggregated capacity of GPU memory, into the GPU in a coarse-grained, on-demand manner during the training stage.

This feature currently supports both single-node and multi-node training. It supports all embedding types and can be used with Norm, Raw, and Parquet dataset formats. We revised our criteo2hugectr tool to support the key set extraction for the Criteo dataset. For more information, refer to HugeCTR Embedding Training Cache and Continuous Training Notebook to learn how to use this feature with the Criteo dataset.

NOTE: The Criteo dataset is a common use case, but embedding training cache is not limited to this dataset.

HugeCTR to ONNX Converter

The HugeCTR to Open Neural Network Exchange (ONNX) converter (hugectr2onnx) is a python package that can convert HugeCTR models to ONNX. It can improve the compatibility of HugeCTR with other deep learning frameworks since ONNX serves as an open-source format for AI models.

After training with our HugeCTR Python APIs, you can get the files for dense models, sparse models, and graph configurations, which are required as inputs when using the hugectr2onnx.converter.convert API. Each HugeCTR layer will correspond to one or several ONNX operators, and the trained model weights will be loaded as initializers in the ONNX graph. You can convert both dense and sparse models or only dense models. For more information, refer to ONNX Converter and hugectr2onnx_demo.ipynb.

Hierarchical Parameter Server

A hierarchical storage mechanism has been implemented between the local SSDs and CPU memory on the HugeCTR Hierarchical Parameter Server (POC). With this implementation, the embedding table no longer has to be stored within the local CPU memory. The distributed Redis cluster has been added as a CPU cache to store larger embedding tables and interact with the GPU embedding cache directly. To assist the Redis cluster with looking up missing embedding keys, the local RocksDB has been implemented to serve as a query engine to back up the complete embedding table on the local SSDs.

Try out our hugectr_wdl_prediction.ipynb Notebook. For more information, refer to Distributed Deployment.

Tools

We currently support the following tools:

Data Generator: A configurable data generator, which is available from the Python interface, can be used to generate a synthetic dataset for benchmarking and research purposes.
Preprocessing Script: We provide a set of scripts that form a template implementation to demonstrate how complex datasets, such as the original Criteo dataset, can be converted into HugeCTR using supported dataset formats such as Norm and RAW. It's used in all of our samples to prepare the data and train various recommender models.

Generating Synthetic Data and Benchmarks

The Norm (with Header) and Raw (without Header) datasets can be generated with hugectr.tools.DataGenerator. For categorical features, you can configure the probability distribution to be uniform or power-law within hugectr.tools.DataGeneratorParam. The default distribution is power law with alpha = 1.2.

Generate the Norm dataset for DCN and start training the HugeCTR model:

python3 ../tools/data_generator/dcn_norm_generate_train.py

Generate the Norm dataset for WDL and start training the HugeCTR model:

python3 ../tools/data_generator/wdl_norm_generate_train.py

Generate the Raw dataset for DLRM and start training the HugeCTR model:

python3 ../tools/data_generator/dlrm_raw_generate_train.py

Generate the Parquet dataset for DCN and start training the HugeCTR model:

python3 ../tools/data_generator/dcn_parquet_generate_train.py

Downloading and Preprocessing Datasets

Download the Criteo 1TB Click Logs dataset using HugeCTR/tools/preprocess.sh and preprocess it to train the DCN. The file_list.txt, file_list_test.txt, and preprocessed data files are available within the criteo_data directory. For more information, refer to our samples.

For example:

$ cd tools # assume that the downloaded dataset is here
$ bash preprocess.sh 1 criteo_data pandas 1 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!