预期 Tensor 分割将利用多 GPU。
当前行为使用多 GPU 时,模型加载后发生段错误。使用 GPU(安装了两个 vega-56)和 HIP_VISIBLE_DEVICES 强制进行单 GPU 推理时,可纠正推理。
环境与背景请提供有关您的计算机设置的详细信息。这很重要,因为除非在某些特定条件下,否则问题无法重现。
- 您正在使用的物理(或虚拟)硬件,例如 Linux:
锐龙 1700x Vega-56 8G*2
- 操作系统(Ubuntu LTS):
Linux jerryxu-Inspiron-5675 6.2.0-33-通用#33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC 星期四 9 月 7 日 10:33:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- SDK 版本,例如对于 Linux:
$ python3 --version
Python 3.10.13
$ make --version
GNU Make 4.3
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
故障信息(针对错误)
查看日志。
重现步骤。- 使用 ROCm 编译 llama.cpp
- 运行任何具有张量分割的模型(尝试了 7B 和 13B 的 2 次量化)
- 出现段错误
llama.cpp 日志:
Log start
main: build = 1310 (1c84003)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1696299120
ggml_init_cublas: found 2 ROCm devices:
Device 0: Radeon RX Vega, compute capability 9.0
Device 1: Radeon RX Vega, compute capability 9.0
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 76.38 MB
llama_new_context_with_model: VRAM scratch buffer: 70.50 MB
llama_new_context_with_model: total VRAM used: 4801.43 MB (model: 4474.93 MB, context: 326.50 MB)
段错误 (核心已转储)
段错误时的 GDB 堆栈跟踪:
#0 0x00007ffff672582e in ?? () from /opt/rocm/lib/libamdhip64.so.5
#1 0x00007ffff672dba0 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#2 0x00007ffff672fc6d in ?? () from /opt/rocm/lib/libamdhip64.so.5
#3 0x00007ffff66f8a44 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#4 0x00007ffff65688e7 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#5 0x00007ffff65689e5 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#6 0x00007ffff6568ae0 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#7 0x00007ffff65ac7a2 in hipMemcpy2DAsync () from /opt/rocm/lib/libamdhip64.so.5
#8 0x00005555556917e6 in ggml_cuda_op_mul_mat (src0=0x7ffd240e06b0, src1=0x7f8ab9ea0860, dst=0x7f8ab9ea09b0,
op=0x55555569f330 <ggml_cuda_op_mul_mat_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, ihipStream_t* const&)>, convert_src1_to_q8_1=true)
at ggml-cuda.cu:6706
#9 0x000055555568cc45 in ggml_cuda_mul_mat (src0=0x7ffd240e06b0, src1=0x7f8ab9ea0860, dst=0x7f8ab9ea09b0) at ggml-cuda.cu:6895
#10 0x000055555568c754 in ggml_cuda_compute_forward (params=0x7ffffffebbb0, tensor=0x7f8ab9ea09b0) at ggml-cuda.cu:7388
#11 0x00005555555b4d1d in ggml_compute_forward (params=0x7ffffffebbb0, tensor=0x7f8ab9ea09b0) at ggml.c:16214
#12 0x00005555555b9a94 in ggml_graph_compute_thread (data=0x7ffffffebc00) at ggml.c:17911
#13 0x00005555555bb123 in ggml_graph_compute (cgraph=0x7f8ab9e00020, cplan=0x7ffffffebd00) at ggml.c:18440
#14 0x00005555555c72aa in ggml_graph_compute_helper (buf=std::vector of length 25112, capacity 25112 = {...}, graph=0x7f8ab9e00020, n_threads=1) at llama.cpp:478
#15 0x00005555555da79f in llama_decode_internal (lctx=..., batch=...) at llama.cpp:4144
#16 0x00005555555e6d41 in llama_decode (ctx=0x5555628ba020, batch=...) at llama.cpp:7454
#17 0x0000555555665dcf in llama_init_from_gpt_params (params=...) at common/common.cpp:845
#18 0x0000555555567b32 in main (argc=8, argv=0x7fffffffde08) at examples/main/main.cpp:181