Crash after model request on multiple GPUs in confidential mode #6667

AgentRX · 2025-01-15T10:29:07Z

Describe the bug

When I start the engine it loads everything. But after first prompt it crashes

Is there an existing issue for this?

I have searched the existing issues

Reproduction

I use this command:

./start_linux.sh --model llama-2-7b-chat.Q8_0.gguf --share

Model was downloaded from here: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

Screenshot

No response

Logs

$ ./start_linux.sh --model llama-2-7b-chat.Q8_0.gguf --share
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
10:26:32-717305 INFO     Starting Text generation web UI                                                                                                                                                                   
10:26:32-721474 WARNING  The gradio "share link" feature uses a proprietary executable to create a reverse tunnel. Use it with care.                                                                                       
10:26:32-722758 WARNING                                                                                                                                                                                                    
                         You are potentially exposing the web UI to the entire internet without any access password.                                                                                                       
                         You can create one with the "--gradio-auth" flag like this:                                                                                                                                       
                                                                                                                                                                                                                           
                         --gradio-auth username:password                                                                                                                                                                   
                                                                                                                                                                                                                           
                         Make sure to replace username:password with your own.                                                                                                                                             
10:26:33-523404 INFO     Loading "llama-2-7b-chat.Q8_0.gguf"                                                                                                                                                               
10:26:34-444027 INFO     llama.cpp weights detected: "models/llama-2-7b-chat.Q8_0.gguf"                                                                                                                                    
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H200, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H200, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H200, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H200, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H200, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H200, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H200, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H200, compute capability 9.0, VMM: yes
llama_model_load_from_file: using device CUDA0 (NVIDIA H200) - 141931 MiB free
llama_model_load_from_file: using device CUDA1 (NVIDIA H200) - 141913 MiB free
llama_model_load_from_file: using device CUDA2 (NVIDIA H200) - 141913 MiB free
llama_model_load_from_file: using device CUDA3 (NVIDIA H200) - 141913 MiB free
llama_model_load_from_file: using device CUDA4 (NVIDIA H200) - 141913 MiB free
llama_model_load_from_file: using device CUDA5 (NVIDIA H200) - 141913 MiB free
llama_model_load_from_file: using device CUDA6 (NVIDIA H200) - 141913 MiB free
llama_model_load_from_file: using device CUDA7 (NVIDIA H200) - 141913 MiB free
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 6.67 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CUDA0 model buffer size =  1025.47 MiB
llm_load_tensors:        CUDA1 model buffer size =   820.38 MiB
llm_load_tensors:        CUDA2 model buffer size =   820.38 MiB
llm_load_tensors:        CUDA3 model buffer size =   820.38 MiB
llm_load_tensors:        CUDA4 model buffer size =   820.38 MiB
llm_load_tensors:        CUDA5 model buffer size =   820.38 MiB
llm_load_tensors:        CUDA6 model buffer size =   820.38 MiB
llm_load_tensors:        CUDA7 model buffer size =   748.11 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   132.81 MiB
..................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 1: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 2: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 3: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 4: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 5: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 6: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 7: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 8: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 9: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 10: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 11: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 12: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 13: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 14: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 15: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 16: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 17: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 18: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 19: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 20: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 21: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 22: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 23: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 24: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 25: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 26: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 27: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 28: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 29: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 30: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init: layer 31: n_embd_k_gqa = 4096, n_embd_v_gqa = 4096
llama_kv_cache_init:      CUDA0 KV buffer size =   320.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   256.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   256.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   256.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   256.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =   256.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =   256.00 MiB
llama_kv_cache_init:      CUDA7 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA6 compute buffer size =   352.01 MiB
llama_new_context_with_model:      CUDA7 compute buffer size =   352.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 9
CUDA : ARCHS = 500,520,530,600,610,620,700,720,750,800,860,870,890,900 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '11008', 'llama.attention.layer_norm_rms_epsilon': '0.000001', 'llama.rope.dimension_count': '128', 'llama.attention.head_count': '32', 'tokenizer.ggml.bos_token_id': '1', 'llama.block_count': '32', 'llama.attention.head_count_kv': '32', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '7'}
Using fallback chat format: llama-2
10:27:28-354925 INFO     Loaded "llama-2-7b-chat.Q8_0.gguf" in 54.83 seconds.                                                                                                                                              
10:27:28-357109 INFO     LOADER: "llama.cpp"                                                                                                                                                                               
10:27:28-358048 INFO     TRUNCATION LENGTH: 4096                                                                                                                                                                           
10:27:28-358900 INFO     INSTRUCTION TEMPLATE: "Alpaca"                                                                                                                                                                    

Running on local URL:  http://127.0.0.1:7860

Running on public URL: https://27a58bdff74870c9ec.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
CUDA error: operation not supported
  current device: 0, in function ggml_backend_cuda_cpy_tensor_async at /home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2258
  cudaMemcpyPeerAsync(dst->data, cuda_ctx_dst->device, src->data, cuda_ctx_src->device, ggml_nbytes(dst), cuda_ctx_src->stream())
/home/runner/work/llama-cpp-python-cuBLAS-wheels/llama-cpp-python-cuBLAS-wheels/vendor/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libggml-base.so(+0x1684b)[0x7d5da623b84b]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libggml-base.so(ggml_abort+0x158)[0x7d5da623bbf8]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libggml-cuda.so(+0x5faf6)[0x7d5b27e5faf6]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libggml-cuda.so(+0x637df)[0x7d5b27e637df]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libggml-base.so(ggml_backend_sched_graph_compute_async+0x3cc)[0x7d5da625175c]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libllama.so(+0x50d50)[0x7d5d67ad7d50]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libllama.so(+0x57f4a)[0x7d5d67adef4a]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda/lib/libllama.so(llama_decode+0x2b)[0x7d5d67adfa8b]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052)[0x7d5e8422d052]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925)[0x7d5e8422b925]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde)[0x7d5e8422c06e]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92e5)[0x7d5e8423d2e5]
/home/sp/text-generation-webui/installer_files/env/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x1267e)[0x7d5e8424667e]
python(_PyObject_MakeTpCall+0x27c)[0x50452c]
python(_PyEval_EvalFrameDefault+0x6a6)[0x511a76]
python[0x555ce1]
python(_PyEval_EvalFrameDefault+0x538)[0x511908]
python[0x555ce1]
python(_PyEval_EvalFrameDefault+0x538)[0x511908]
python[0x555ce1]
python(_PyEval_EvalFrameDefault+0x538)[0x511908]
python[0x5581df]
python[0x5579ce]
python(PyObject_Call+0x12c)[0x5430ac]
python(_PyEval_EvalFrameDefault+0x47c0)[0x515b90]
python(_PyFunction_Vectorcall+0x173)[0x539153]
python(_PyEval_EvalFrameDefault+0x47c0)[0x515b90]
python[0x5581df]
python[0x557a20]
python[0x62a8a3]
python[0x5fa3c4]
/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94)[0x7d5e8549ca94]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x7d5e85529c3c]

System Info

I run engine on 8xH200:

$ nvidia-smi
Wed Jan 15 10:22:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:01:00.0 Off |                    0 |
| N/A   31C    P0             76W /  700W |     114MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H200                    On  |   00000000:02:00.0 Off |                    0 |
| N/A   28C    P0             74W /  700W |     132MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H200                    On  |   00000000:03:00.0 Off |                    0 |
| N/A   29C    P0             75W /  700W |     132MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H200                    On  |   00000000:04:00.0 Off |                    0 |
| N/A   30C    P0             75W /  700W |     132MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H200                    On  |   00000000:05:00.0 Off |                    0 |
| N/A   32C    P0             76W /  700W |     132MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H200                    On  |   00000000:06:00.0 Off |                    0 |
| N/A   31C    P0             74W /  700W |     132MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H200                    On  |   00000000:07:00.0 Off |                    0 |
| N/A   33C    P0             77W /  700W |     132MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H200                    On  |   00000000:08:00.0 Off |                    0 |
| N/A   30C    P0             73W /  700W |     132MiB / 143771MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Everything works in confidential virtual machine and NVSwitch/NVLink not available:
$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-31	0		N/A
GPU1	PHB	 X 	PHB	PHB	PHB	PHB	PHB	PHB	0-31	0		N/A
GPU2	PHB	PHB	 X 	PHB	PHB	PHB	PHB	PHB	0-31	0		N/A
GPU3	PHB	PHB	PHB	 X 	PHB	PHB	PHB	PHB	0-31	0		N/A
GPU4	PHB	PHB	PHB	PHB	 X 	PHB	PHB	PHB	0-31	0		N/A
GPU5	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB	0-31	0		N/A
GPU6	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	0-31	0		N/A
GPU7	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

AgentRX added the bug Something isn't working label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash after model request on multiple GPUs in confidential mode #6667

Crash after model request on multiple GPUs in confidential mode #6667

AgentRX commented Jan 15, 2025

Crash after model request on multiple GPUs in confidential mode #6667

Crash after model request on multiple GPUs in confidential mode #6667

Comments

AgentRX commented Jan 15, 2025

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info