We need several steps to run the benchmark.
-
Select a docker image from Merlin TensorFlow
docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:nightly
-
Launch the Merlin TensorFlow container with the following command:
docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-tensorflow:nightly
-
Install the SOK+HKV from source code:
git clone ssh://[email protected]:12051/dl/hugectr.git cd hugectr git submodule init && git submodule update cd sparse_operation_kit mkdir build && cd build cmake -DSM={your SM version} .. make -j && make install rm -rf /usr/local/lib/python3.10/dist-packages/merlin_sok-1.x-py3.10-linux-x86_64.egg cp -r ../sparse_operation_kit /usr/local/lib/python3.10/dist-packages/
Please generate training data according to the DLRM DCNV2 documentation.
-
Go to the work directory:
cd documents/tutorials/DLRM_Benchmark
-
Prepare Criteo Terabyte dataset
# train_data.bin and test_data.bin is the binary dataset generated by hugectr # {splited_dataset} is the target directory to save the dataset python3 ./preprocess/split_bin.py /path/to/train_data.bin splited_dataset/train --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]" python3 ./preprocess/split_bin.py /path/to/test_data.bin splited_dataset/test --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]"
-
Run the benchmark:
Typically one GPU will be allocated per process, so if a server has 4 GPUs, you will run 4 processes. In
horovodrun
, the number of processes is specified with the-np
flag.# batch size = 65536 horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=100 --lr=24 # batch size = 32768 horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=32768 --epochs=100 --lr=24
There are three key options when want to create a HKV instance:
init_capacity
: The maximum number of KV pairs that HKV can hold when it is first created.max_capacity
: The maximum number of KV pairs that can be held when HKV is stable (until the last of training process). In the training process, if the load factor greater than a threshold, HKV's capacity will be doubled, but it will not exceedmax_capacity
.max_hbm_for_vectors
: The maximum size of HBM which can be used for HKV to store values (vectors, embeddings). However, HKV will not occupy them all at once. Instead, it will apply these resources when it needs, but please be sure that the system can satisfy its requirement, or the program will crash.
We can also change the optimizer use --optimizer_name
,now support sgd
, adamax
, adagrad
, adadelta
, ftrl
When we choose HKV as the backend of SOK, DynamicVariable
should be initialized in this way:
self._sok_embedding = sok.DynamicVariable(
var_type="hybrid",
dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset
)
By default, the init_capacity and max_capacity of HKV will both be set to 64 * 1024 * 1024, and the max_hbm_for_vectors is 16GB.
We can also customize the configuration of HKV:
self._sok_embedding = sok.DynamicVariable(
var_type="hybrid",
dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset
init_capacity = 1024 * 1024,
max_capacity = 1024 * 1024,
max_hbm_for_vectors=30, # unit:GB
)
Be careful to set the max_hbm_for_vectors
, and there are three factors that affect the setting of this value:
- Total HBM size.
- Type of optimizer.
- Batch size.
These factors will limit the HBM memory resource which is available to HKV. If not appropriate, the program will be at risk of Out Of Memory.
By the way, HKV will not consume more resources than it needs. For example, it will only consume max_capacity * dimension * elementSize
to store embeddings when max_capacity * dimension * elementSize
is less than x GB
which x
equals to max_hbm_for_vectors
.
batch size \ optimizer | SGD | Adamax | Adagrad | Addelta | Ftrl |
---|---|---|---|---|---|
32768 | 60G | 20G | 35G | 20G | 20G |
65536 | 60G | 20G | 20G | 20G | 20G |
131072 | 60G | 20G | 20G | 20G | 10G |
262144 | 60G | 20G | 20G | 20G | 10G |
batch size | exit criteria | frequent of evaluation | xla | amp | training time (minutes) | evaluating time (minutes) | total time (minutes) | average time of iteration (ms) | throughput(samples/second) |
---|---|---|---|---|---|---|---|---|---|
65536 | 1 epoch | at end | yes | yes | no | yes | 8.79 | 0.10 | 4.16M |
65536 | 1 epoch | at end | yes | yes | yes | no | 6.72 | 0.09 | 3.45M |