SOK_DLRM_Benchmark

Hui Kang

and

minseokl

Sok documentation

Jun 14, 2024

57edc38 · Jun 14, 2024

History

Name	Name	Last commit message	Last commit date
parent directory ..
preprocess	preprocess	Sok documentation	Jun 14, 2024
README.md	README.md	Sok documentation	Jun 14, 2024
dataset.py	dataset.py	Sok documentation	Jun 14, 2024
hvd_wrapper.sh	hvd_wrapper.sh	Sok documentation	Jun 14, 2024
main.py	main.py	Sok documentation	Jun 14, 2024
model.py	model.py	Sok documentation	Jun 14, 2024
trainer.py	trainer.py	Sok documentation	Jun 14, 2024

README.md

Benchmark DLRM DCNV2 using TF + SOK + HKV

We need several steps to run the benchmark.

Environment

Select a docker image from Merlin TensorFlow

docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

Launch the Merlin TensorFlow container with the following command:

docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

Install the SOK+HKV from source code:

git clone ssh://[email protected]:12051/dl/hugectr.git
cd hugectr
git submodule init && git submodule update
cd sparse_operation_kit
mkdir build && cd build
cmake -DSM={your SM version} ..
make -j && make install
rm -rf /usr/local/lib/python3.10/dist-packages/merlin_sok-1.x-py3.10-linux-x86_64.egg
cp -r ../sparse_operation_kit /usr/local/lib/python3.10/dist-packages/

How to Prepare Dataset

Please generate training data according to the DLRM DCNV2 documentation.

Benchmark

Go to the work directory:
```
cd documents/tutorials/DLRM_Benchmark
```

Prepare Criteo Terabyte dataset

# train_data.bin and test_data.bin is the binary dataset generated by hugectr
# {splited_dataset} is the target directory to save the dataset
python3 ./preprocess/split_bin.py /path/to/train_data.bin splited_dataset/train --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]"
python3 ./preprocess/split_bin.py /path/to/test_data.bin splited_dataset/test --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]"

Run the benchmark:

Typically one GPU will be allocated per process, so if a server has 4 GPUs, you will run 4 processes. In horovodrun, the number of processes is specified with the -np flag.

# batch size = 65536
horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=100 --lr=24


# batch size = 32768
horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=32768 --epochs=100 --lr=24

Details about Customized tests

1. Initialize the HKV

There are three key options when want to create a HKV instance:

init_capacity: The maximum number of KV pairs that HKV can hold when it is first created.
max_capacity: The maximum number of KV pairs that can be held when HKV is stable (until the last of training process). In the training process, if the load factor greater than a threshold, HKV's capacity will be doubled, but it will not exceed max_capacity.
max_hbm_for_vectors: The maximum size of HBM which can be used for HKV to store values (vectors, embeddings). However, HKV will not occupy them all at once. Instead, it will apply these resources when it needs, but please be sure that the system can satisfy its requirement, or the program will crash.

2. Optimizer

We can also change the optimizer use --optimizer_name,now support sgd, adamax, adagrad, adadelta, ftrl

DynamicVariable Configuration

3.1 Default behavior

When we choose HKV as the backend of SOK, DynamicVariable should be initialized in this way:

self._sok_embedding = sok.DynamicVariable(
    var_type="hybrid",
    dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset
)

By default, the init_capacity and max_capacity of HKV will both be set to 64 * 1024 * 1024, and the max_hbm_for_vectors is 16GB.

3.2 Customize

We can also customize the configuration of HKV:

self._sok_embedding = sok.DynamicVariable(
    var_type="hybrid",
    dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset
    init_capacity = 1024 * 1024,
    max_capacity = 1024 * 1024,
    max_hbm_for_vectors=30, # unit:GB
)

Be careful to set the max_hbm_for_vectors, and there are three factors that affect the setting of this value:

Total HBM size.
Type of optimizer.
Batch size.

These factors will limit the HBM memory resource which is available to HKV. If not appropriate, the program will be at risk of Out Of Memory. By the way, HKV will not consume more resources than it needs. For example, it will only consume max_capacity * dimension * elementSize to store embeddings when max_capacity * dimension * elementSize is less than x GB which x equals to max_hbm_for_vectors.

batch size \ optimizer	SGD	Adamax	Adagrad	Addelta	Ftrl
32768	60G	20G	35G	20G	20G
65536	60G	20G	20G	20G	20G
131072	60G	20G	20G	20G	10G
262144	60G	20G	20G	20G	10G

Performance on 8 x H100

batch size	exit criteria	frequent of evaluation	xla	amp	training time (minutes)	evaluating time (minutes)	total time (minutes)	average time of iteration (ms)	throughput(samples/second)
65536	1 epoch	at end	yes	yes	no	yes	8.79	0.10	4.16M
65536	1 epoch	at end	yes	yes	yes	no	6.72	0.09	3.45M

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

SOK_DLRM_Benchmark

SOK_DLRM_Benchmark

README.md

Benchmark DLRM DCNV2 using TF + SOK + HKV

Environment

How to Prepare Dataset

Benchmark

Details about Customized tests

1. Initialize the HKV

2. Optimizer

DynamicVariable Configuration

3.1 Default behavior

3.2 Customize

Performance on 8 x H100

Files

SOK_DLRM_Benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

SOK_DLRM_Benchmark

Folders and files

parent directory

README.md

Benchmark DLRM DCNV2 using TF + SOK + HKV

Environment

How to Prepare Dataset

Benchmark

Details about Customized tests

1. Initialize the HKV

2. Optimizer

DynamicVariable Configuration

3.1 Default behavior

3.2 Customize

Performance on 8 x H100