Skip to content

Files

Latest commit

Hui Kangminseokl
Hui Kang
and
Jun 14, 2024
57edc38 · Jun 14, 2024

History

History

SOK_DLRM_Benchmark

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jun 14, 2024
Jun 14, 2024
Jun 14, 2024
Jun 14, 2024
Jun 14, 2024
Jun 14, 2024
Jun 14, 2024

Benchmark DLRM DCNV2 using TF + SOK + HKV

We need several steps to run the benchmark.

Environment

  1. Select a docker image from Merlin TensorFlow

    docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:nightly
  2. Launch the Merlin TensorFlow container with the following command:

    docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-tensorflow:nightly
  3. Install the SOK+HKV from source code:

    git clone ssh://[email protected]:12051/dl/hugectr.git
    cd hugectr
    git submodule init && git submodule update
    cd sparse_operation_kit
    mkdir build && cd build
    cmake -DSM={your SM version} ..
    make -j && make install
    rm -rf /usr/local/lib/python3.10/dist-packages/merlin_sok-1.x-py3.10-linux-x86_64.egg
    cp -r ../sparse_operation_kit /usr/local/lib/python3.10/dist-packages/

How to Prepare Dataset

Please generate training data according to the DLRM DCNV2 documentation.

Benchmark

  1. Go to the work directory:

    cd documents/tutorials/DLRM_Benchmark
  2. Prepare Criteo Terabyte dataset

    # train_data.bin and test_data.bin is the binary dataset generated by hugectr
    # {splited_dataset} is the target directory to save the dataset
    python3 ./preprocess/split_bin.py /path/to/train_data.bin splited_dataset/train --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]"
    python3 ./preprocess/split_bin.py /path/to/test_data.bin splited_dataset/test --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]"
  3. Run the benchmark:

    Typically one GPU will be allocated per process, so if a server has 4 GPUs, you will run 4 processes. In horovodrun, the number of processes is specified with the -np flag.

    # batch size = 65536
    horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=100 --lr=24
    
    
    # batch size = 32768
    horovodrun -np ${num_gpus} ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=32768 --epochs=100 --lr=24
    

Details about Customized tests

1. Initialize the HKV

There are three key options when want to create a HKV instance:

  • init_capacity: The maximum number of KV pairs that HKV can hold when it is first created.
  • max_capacity: The maximum number of KV pairs that can be held when HKV is stable (until the last of training process). In the training process, if the load factor greater than a threshold, HKV's capacity will be doubled, but it will not exceed max_capacity.
  • max_hbm_for_vectors: The maximum size of HBM which can be used for HKV to store values (vectors, embeddings). However, HKV will not occupy them all at once. Instead, it will apply these resources when it needs, but please be sure that the system can satisfy its requirement, or the program will crash.

2. Optimizer

We can also change the optimizer use --optimizer_name,now support sgd, adamax, adagrad, adadelta, ftrl

DynamicVariable Configuration

3.1 Default behavior

When we choose HKV as the backend of SOK, DynamicVariable should be initialized in this way:

self._sok_embedding = sok.DynamicVariable(
    var_type="hybrid",
    dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset
)

By default, the init_capacity and max_capacity of HKV will both be set to 64 * 1024 * 1024, and the max_hbm_for_vectors is 16GB.

3.2 Customize

We can also customize the configuration of HKV:

self._sok_embedding = sok.DynamicVariable(
    var_type="hybrid",
    dimension=self._embedding_vec_size, # 128 in Criteo Terabyte Dataset
    init_capacity = 1024 * 1024,
    max_capacity = 1024 * 1024,
    max_hbm_for_vectors=30, # unit:GB
)

Be careful to set the max_hbm_for_vectors, and there are three factors that affect the setting of this value:

  • Total HBM size.
  • Type of optimizer.
  • Batch size.

These factors will limit the HBM memory resource which is available to HKV. If not appropriate, the program will be at risk of Out Of Memory. By the way, HKV will not consume more resources than it needs. For example, it will only consume max_capacity * dimension * elementSize to store embeddings when max_capacity * dimension * elementSize is less than x GB which x equals to max_hbm_for_vectors.

batch size \ optimizer SGD Adamax Adagrad Addelta Ftrl
32768 60G 20G 35G 20G 20G
65536 60G 20G 20G 20G 20G
131072 60G 20G 20G 20G 10G
262144 60G 20G 20G 20G 10G

Performance on 8 x H100

batch size exit criteria frequent of evaluation xla amp training time (minutes) evaluating time (minutes) total time (minutes) average time of iteration (ms) throughput(samples/second)
65536 1 epoch at end yes yes no yes 8.79 0.10 4.16M
65536 1 epoch at end yes yes yes no 6.72 0.09 3.45M