在RTX 4090(24G)上 尝试复现视频中llama.cpp与贵工作在 Falcon 40B上的对比效果。 Powerinfer使用的模型是PowerInfer/ReluFalcon-40B-PowerInfer-GGUF ,推理效果优秀。 使用SparseLLM/ReluFalcon-40B所提供的fp16模型在llama.cpp下进行转化
python3 convert-hf-to-gguf.py ./ReluFalcon-40B --outtype f16
转化后的模型使用如下命令进行推理。
./build/bin/main -m ./ReluFalcon-40B/ggml-model-f16.gguf -n 128 -t 12 -p "Once upon a time" -ngl 10 --batch-size 512 --ignore-eos
推理结果为:
system_info: n_threads = 12 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
<|endoftext|>Once upon a time the the the, the
llama_print_timings: load time = 97188.82 ms llama_print_timings: sample time = 5.89 ms / 6 runs ( 0.98 ms per token, 1018.68 tokens per second) llama_print_timings: prompt eval time = 33731.64 ms / 5 tokens ( 6746.33 ms per token, 0.15 tokens per second) llama_print_timings: eval time = 169189.81 ms / 5 runs (33837.96 ms per token, 0.03 tokens per second) llama_print_timings: total time = 211154.84 ms / 10 tokens
反复测试都是不停输出the或其他奇怪的结果。采用INT4量化后仍然是这个结果。 请问是在llama.cpp上加载的模型不符合吗?我应该如何解决这个问题呢?
谢谢!
Additional Context$ uname -r
6.5.0-35-generic
$ nvidia-smi
NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4