intel
diff --git a/‎README.md
+1-1 b/‎README.md
+1-1
diff --git a/‎models_v2/pytorch/torchrec_dlrm/inference/cpu/README.md
+110 b/‎models_v2/pytorch/torchrec_dlrm/inference/cpu/README.md
+110
diff --git a/‎models_v2/pytorch/torchrec_dlrm/inference/cpu/__init__.py b/‎models_v2/pytorch/torchrec_dlrm/inference/cpu/__init__.py
diff --git a/‎models_v2/pytorch/torchrec_dlrm/inference/cpu/data_process/__init__.py b/‎models_v2/pytorch/torchrec_dlrm/inference/cpu/data_process/__init__.py
diff --git a/‎models_v2/pytorch/torchrec_dlrm/training/gpu/data/dlrm_dataloader.py renamed to ‎models_v2/pytorch/torchrec_dlrm/inference/cpu/data_process/dlrm_dataloader.py
+15-43 b/‎models_v2/pytorch/torchrec_dlrm/training/gpu/data/dlrm_dataloader.py renamed to ‎models_v2/pytorch/torchrec_dlrm/inference/cpu/data_process/dlrm_dataloader.py
+15-43
diff --git a/‎models_v2/pytorch/torchrec_dlrm/training/gpu/data/multi_hot_criteo.py renamed to ‎models_v2/pytorch/torchrec_dlrm/inference/cpu/data_process/multi_hot_criteo.py
+27-1 b/‎models_v2/pytorch/torchrec_dlrm/training/gpu/data/multi_hot_criteo.py renamed to ‎models_v2/pytorch/torchrec_dlrm/inference/cpu/data_process/multi_hot_criteo.py
+27-1
@@ -109,7 +109,7 @@ For best performance on Intel® Data Center GPU Flex and Max Series, please chec
 | [Wide & Deep](https://arxiv.org/pdf/1606.07792.pdf) | TensorFlow | Inference | [FP32](/benchmarks/recommendation/tensorflow/wide_deep/inference/README.md) | [Census Income dataset](https://github.com/IntelAI/models/tree/master/benchmarks/recommendation/tensorflow/wide_deep/inference/fp32#dataset) |
 | [DLRM](https://arxiv.org/pdf/1906.00091.pdf)         | PyTorch | Inference | [FP32 Int8 BFloat16 BFloat32](/models_v2/pytorch/dlrm/inference/cpu/README.md) | [Criteo Terabyte](/models_v2/pytorch/dlrm/inference/cpu/README.md#datasets) |
 | [DLRM](https://arxiv.org/pdf/1906.00091.pdf)         | PyTorch | Training  | [FP32 BFloat16 BFloat32](/models_v2/pytorch/dlrm/training/cpu/README.md) | [Criteo Terabyte](/models_v2/pytorch/dlrm/training/cpu/README.md#datasets) |
-| [DLRM v2](https://arxiv.org/pdf/1906.00091.pdf)         | PyTorch | Inference | [FP32 FP16 BFloat16 BFloat32 Int8](/quickstart/recommendation/pytorch/torchrec_dlrm/inference/cpu/README.md) | [Criteo 1TB Click Logs dataset](/quickstart/recommendation/pytorch/torchrec_dlrm/inference/cpu#datasets) |
+| [DLRM v2](https://arxiv.org/pdf/1906.00091.pdf)         | PyTorch | Inference | [FP32 FP16 BFloat16 BFloat32 Int8](/models_v2/pytorch/torchrec_dlrm/inference/cpu/README.md) | [Criteo 1TB Click Logs dataset](/quickstart/recommendation/pytorch/torchrec_dlrm/inference/cpu#datasets) |
 
 ### Diffusion
 
 
@@ -0,0 +1,110 @@
+# DLRM v2 Inference
+
+DLRM v2 Inference best known configurations with Intel® Extension for PyTorch.
+
+## Model Information
+
+| **Use Case** | **Framework** | **Model Repo** | **Branch/Commit/Tag** | **Optional Patch** |
+|:---:| :---: |:--------------:|:---------------------:|:------------------:|
+|  Inference   |    PyTorch    |       https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm       |           -           |         -          |
+
+# Pre-Requisite
+## Bare Metal
+### General setup
+
+Follow [link](https://github.com/IntelAI/models/blob/master/docs/general/pytorch/BareMetalSetup.md) to build Pytorch, IPEX, TorchVison and TCMalloc.
+
+### Model Specific Setup
+
+* Installation of [Build PyTorch + IPEX + TorchVision Jemalloc and TCMalloc](https://github.com/IntelAI/models/blob/master/docs/general/pytorch/BareMetalSetup.md)
+* Installation of [oneccl-bind-pt](https://pytorch-extension.intel.com/release-whl/stable/cpu/us/oneccl-bind-pt/) (if running distributed)
+* Set Jemalloc and tcmalloc Preload for better performance
+
+  The jemalloc and tcmalloc should be built from the [General setup](#general-setup) section.
+  ```
+  export LD_PRELOAD="<path to the jemalloc directory>/lib/libjemalloc.so":"path_to/tcmalloc/lib/libtcmalloc.so":$LD_PRELOAD
+  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"
+  ```
+* Set IOMP preload for better performance
+  ```
+  pip install packaging intel-openmp
+  export LD_PRELOAD=path/lib/libiomp5.so:$LD_PRELOAD
+  ```
+
+* Set ENV to use AMX if you are using SPR
+  ```bash
+  export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX
+  ```
+* Set ENV to use fp16 AMX if you are using a supported platform
+  ```
+  export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16
+  ```
+
+## Datasets
+The dataset can be downloaded and preprocessed by following https://github.com/mlcommons/training/tree/master/recommendation_v2/torchrec_dlrm#create-the-synthetic-multi-hot-dataset.
+We also provided a preprocessed scripts based on the instruction above. `preprocess_raw_dataset.sh`.
+After you loading the raw dataset `day_*.gz` and unzip them to RAW_DIR.
+```bash
+cd <AI Reference Models>/models_v2/pytorch/torchrec_dlrm/inference/cpu
+export MODEL_DIR=$(pwd)
+export RAW_DIR=<the unziped raw dataset>
+export TEMP_DIR=<where your choose the put the temp file during preprocess>
+export PREPROCESSED_DIR=<where your choose the put the one-hot dataset>
+export MULTI_HOT_DIR=<where your choose the put the multi-hot dataset>
+bash preprocess_raw_dataset.sh
+```
+
+## Pre-Trained checkpoint
+You can download and unzip checkpoint by following
+https://github.com/mlcommons/inference/tree/master/recommendation/dlrm_v2/pytorch#downloading-model-weights
+
+## Inference
+1. `git clone https://github.com/IntelAI/models.git`
+2. `cd models/models_v2/pytorch/torchrec_dlrm/inference/cpu`
+3. Create virtual environment `venv` and activate it:
+    ```
+    python3 -m venv venv
+    . ./venv/bin/activate
+    ```
+4. Install general model requirements
+    ```
+    ./setup.sh
+    ```
+5. Install the latest CPU versions of [torch, torchvision and intel_extension_for_pytorch](https://intel.github.io/intel-extension-for-pytorch/index.html#installation).
+
+6. Setup required environment paramaters
+
+| **Parameter**                |                                  **export command**                                  |
+|:---------------------------:|:------------------------------------------------------------------------------------:|
+| **TEST_MODE** (THROUGHPUT, ACCURACY)              | `export TEST_MODE=THROUGHPUT`                  |
+| **DATASET_DIR**             |                               `export DATASET_DIR=<multi-hot dataset dir>`                                  |
+| **WEIGHT_DIR** (ONLY FOR ACCURACY)     |                 `export WEIGHT_DIR=<offical released checkpoint>`        |
+| **PRECISION**    |                               `export PRECISION=int8 <specify the precision to run: int8, fp32, bf32 or bf16>`                             |
+| **OUTPUT_DIR**    |                               `export OUTPUT_DIR=$PWD`                               |
+| **BATCH_SIZE** (optional)  |                               `export BATCH_SIZE=10000`                                |
+7. Run `run_model.sh`
+## Output
+
+Single-tile output will typically look like:
+
+```
+accuracy 76.215 %, best 76.215 %
+dlrm_inf latency:  0.11193203926086426  s
+dlrm_inf avg time:  0.007462135950724284  s, ant the time count is : 15
+dlrm_inf throughput:  4391235.996821996  samples/s
+```
+
+
+Final results of the inference run can be found in `results.yaml` file.
+```
+results:
+ - key: throughput
+   value: 4391236.0
+   unit: inst/s
+ - key: latency
+   value: 0.007462135950724283
+   unit: s
+ - key: accuracy
+   value: 76.215
+   unit: accuracy
+```
@@ -1,3 +1,6 @@
+#
+# -*- coding: utf-8 -*-
+#
 # Copyright (c) 2023 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -12,6 +15,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+
 #!/usr/bin/env python3
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 #
@@ -31,8 +35,6 @@
     DEFAULT_INT_NAMES,
     InMemoryBinaryCriteoIterDataPipe,
 )
-# This is for crop dataset
-DAYS_MIN=1
 from torchrec.datasets.random import RandomRecDataset
 
 # OSS import
@@ -96,44 +98,18 @@ def _get_in_memory_dataloader(
         sparse_part = "sparse_multi_hot.npz"
         datapipe = MultiHotCriteoIterDataPipe
 
-    if args.dataset_name == "criteo_kaggle":
-        # criteo_kaggle has no validation set, so use 2nd half of training set for now.
-        # Setting stage to "test" will get the 2nd half of the dataset.
-        # Setting root_name to "train" reads from the training set file.
-        (root_name, stage) = ("train", "test") if stage == "val" else stage
+    if stage == "train":
         stage_files: List[List[str]] = [
-            [os.path.join(dir_path, f"{root_name}_dense.npy")],
-            [os.path.join(dir_path, f"{root_name}_{sparse_part}")],
-            [os.path.join(dir_path, f"{root_name}_labels.npy")],
+            [os.path.join(dir_path, f"day_{i}_dense.npy") for i in range(DAYS - 1)],
+            [os.path.join(dir_path, f"day_{i}_{sparse_part}") for i in range(DAYS - 1)],
+            [os.path.join(dir_path, f"day_{i}_labels.npy") for i in range(DAYS - 1)],
         ]
-    # criteo_1tb code path uses below two conditionals
-    elif stage == "train":
-        if args.converge:
-            stage_files: List[List[str]] = [
-                [os.path.join(dir_path, f"day_{i}_dense.npy") for i in range(DAYS - 1)],
-                [os.path.join(dir_path, "multihot", f"day_{i}_{sparse_part}") for i in range(DAYS - 1)],
-                [os.path.join(dir_path, f"day_{i}_labels.npy") for i in range(DAYS - 1)],
-            ]
-        else:
-            stage_files: List[List[str]] = [
-                # for crop dataset
-                [os.path.join(dir_path, f"day_{i}_dense.npy") for i in range(DAYS_MIN)],
-                [os.path.join(dir_path, f"day_{i}_{sparse_part}") for i in range(DAYS_MIN)],
-                [os.path.join(dir_path, f"day_{i}_labels.npy") for i in range(DAYS_MIN)],
-            ]
     elif stage in ["val", "test"]:
-        if args.converge:
-            stage_files: List[List[str]] = [
-                [os.path.join(dir_path, f"day_{DAYS-1}_dense.npy")],
-                [os.path.join(dir_path, "multihot", f"day_{DAYS-1}_{sparse_part}")],
-                [os.path.join(dir_path, f"day_{DAYS-1}_labels.npy")],
-            ]
-        else:
-            stage_files: List[List[str]] = [
-                [os.path.join(dir_path, f"day_{DAYS_MIN-1}_dense.npy")],
-                [os.path.join(dir_path, f"day_{DAYS_MIN-1}_{sparse_part}")],
-                [os.path.join(dir_path, f"day_{DAYS_MIN-1}_labels.npy")],
-            ]
+        stage_files: List[List[str]] = [
+            [os.path.join(dir_path, f"day_{DAYS-1}_dense.npy")],
+            [os.path.join(dir_path, f"day_{DAYS-1}_{sparse_part}")],
+            [os.path.join(dir_path, f"day_{DAYS-1}_labels.npy")],
+        ]
     if stage in ["val", "test"] and args.test_batch_size is not None:
         batch_size = args.test_batch_size
     else:
@@ -143,11 +119,8 @@ def _get_in_memory_dataloader(
             stage,
             *stage_files,  # pyre-ignore[6]
             batch_size=batch_size,
-            #rank=dist.get_rank(),
-            #world_size=dist.get_world_size(),
-            # The rand and world_size set for custom dist-dlrm
-            rank=0,
-            world_size=1,
+            rank=0, #  dist.get_rank(),
+            world_size=1, #  dist.get_world_size(),
             drop_last=args.drop_last_training_batch if stage == "train" else False,
             shuffle_batches=args.shuffle_batches,
             shuffle_training_set=args.shuffle_training_set,
@@ -158,7 +131,6 @@ def _get_in_memory_dataloader(
             else ([args.num_embeddings] * CAT_FEATURE_COUNT),
         ),
         batch_size=None,
-        num_workers=0,
         pin_memory=args.pin_memory,
         collate_fn=lambda x: x,
     )
 
@@ -1,3 +1,21 @@
+#
+# -*- coding: utf-8 -*-
+#
+# Copyright (c) 2023 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
 #!/usr/bin/env python3
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 #
@@ -214,7 +232,7 @@ def _np_arrays_to_batch(
         offset_per_key = torch.cumsum(
             torch.concat((torch.tensor([0]), torch.tensor(length_per_key))), dim=0
         )
-        values = torch.concat([torch.from_numpy(feat).flatten() for feat in sparse])
+        values = torch.concat([torch.from_numpy(feat.copy()).flatten() for feat in sparse])
         return Batch(
             dense_features=torch.from_numpy(dense.copy()),
             sparse_features=KeyedJaggedTensor(
@@ -308,3 +326,11 @@ def append_to_buffer(
 
     def __len__(self) -> int:
         return self.num_full_batches // self.world_size + (self.last_batch_sizes[0] > 0)
+
+    def load_batch(self, sample_list=None) -> Batch:
+        if sample_list is None:
+            sample_list = list(range(self.batch_size))
+        dense = self.dense_arrs[0][sample_list, :]
+        sparse = [arr[sample_list, :] % self.hashes[i] for i, arr in enumerate(self.sparse_arrs[0])]
+        labels = self.labels_arrs[0][sample_list, :]
+        return self._np_arrays_to_batch(dense, sparse, labels)