The purpose of this sample is to demonstrate how to build and train a Deep & Cross Network with HugeCTR.
You can set up the HugeCTR Docker environment by doing one of the following:
HugeCTR is available as buildable source code, but the easiest way to install and run HugeCTR is to pull the pre-built Docker image, which is available on the NVIDIA GPU Cloud (NGC). This method provides a self-contained, isolated, and reproducible environment for repetitive experiments.
- Pull the HugeCTR NGC Docker by running the following command:
$ docker pull nvcr.io/nvidia/merlin/merlin-training:22.03
- Launch the container in interactive mode with the HugeCTR root directory mounted into the container by running the following command:
$ docker run --gpus=all --rm -it --cap-add SYS_NICE -u $(id -u):$(id -g) -v $(pwd):/hugectr -w /hugectr nvcr.io/nvidia/merlin/merlin-training:22.03
If you want to build the HugeCTR Docker container on your own, refer to How to Start Your Development.
You should make sure that HugeCTR is built and installed in /usr/local/hugectr
within the Docker container. Remember to set the option ENABLE_MULTINODES
as ON
when building HugeCTR if you want to try the multinode training sample. You can launch the container in interactive mode in the same manner as shown above, and then set the PYTHONPATH
environment variable inside the Docker container using the following command:
$ export PYTHONPATH=/usr/local/hugectr/lib:$PYTHONPATH
Go here and download one of the dataset files into the "${project_root}/tools" directory.
As an alternative, you can run the following command:
$ cd ${project_root}/tools
$ wget http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_1.gz
NOTE: Replace 1
with a value from [0, 23] to use a different day.
During preprocessing, the amount of data, which is used to speed up the preprocessing, fill missing values, and remove the feature values that are considered rare, is further reduced.
When running this sample, the Criteo 1TB Click Logs dataset is used. The dataset contains 24 files in which each file corresponds to one day of data. To reduce preprocessing time, only one file is used. Each sample consists of a label (0 if the ad wasn't clicked and 1 if the ad was clicked) and 39 features (13 integer features and 26 categorical features). The dataset is also missing numerous values across the feature columns, which should be preprocessed accordingly.
After you've downloaded the dataset, you can use one of the following methods to prepare the dataset for HugeCTR training:
To preprocess the dataset through Pandas, run the following command:
$ bash preprocess.sh 1 criteo_data pandas 1 0
IMPORTANT NOTES:
- The first argument represents the dataset postfix. For example, if
day_1
is used, its postfix is1
. - The second argument,
criteo_data
, is where the preprocessed data is stored. You may want to change it in cases where multiple datasets are generated concurrently for various purposes. If you change it, thesource
andeval_source
in your JSON configuration file must be changed as well. - The fourth argument (the one after
pandas
) indicates that normalization has been applied to dense features (1=ON, 0=OFF). - The last argument determines whether the feature crossing should be applied (1=ON, 0=OFF). It must remain set to
0
.
HugeCTR supports data processing through NVTabular. Make sure that the NVTabular Docker environment has been set up successfully. For more information, see NVTAbular github. Ensure that you're using the latest version of NVTabular and mount the HugeCTR ${project_root} volume into the NVTabular Docker.
Execute the following preprocessing command:
$ bash preprocess.sh 1 criteo_data nvt 1 0 0
IMPORTANT NOTES:
- The first and second arguments are the same as Pandas's as shown above.
- If you want to generate binary data using the
Norm
data format instead of theParquet
data format, set the fourth argument (the one afternvt
) to0
. Generating binary data using theNorm
data format can take much longer than it does when using theParquet
data format because of the additional conversion process. Use the NVTabular binary mode if you encounter an issue with Pandas mode. - The last argument determines whether the feature crossing should be applied. It must remain set to
0
(OFF).
Run the following command after preprocessing the dataset with Pandas:
$ python3 ../samples/dcn/dcn.py
Run one of the following commands after preprocessing the dataset with NVTabular using either the Parquet or Binary output:
Parquet Output
$ python3 ../samples/dcn/dcn_parquet.py
Binary Output
$ python3 ../samples/dcn/dcn_bin.py
NOTE: If you want to generate binary data using the Norm
data format instead of the Parquet
data format, set the fourth argument (the one after nvt
) to 0
. Generating binary data using the Norm
data format can take much longer than it does when using the Parquet
data format because of the additional conversion process. Use the NVTabular binary mode if you encounter an issue with Pandas mode.
To train with localized slot embedding, do the following: Run the following command after preprocessing the dataset with Pandas:
$ python3 ../samples/dcn/dcn_localized_embedding.py
To train with multinodes, do the following: Run the following command after preprocessing the dataset with Pandas:
$ mpirun -np 2 python3 ../samples/dcn/dcn_2node_8gpu.py