6
问题
运行时出现 CUDA error 1 at /root/PowerInfer/ggml-cuda.cu:8949: invalid argument 所有依赖已经满足,请提供一下解决思路,谢谢
配置Cpu:Intel(R) Xeon(R) Platinum 8474C
Gpu:NVIDIA GeForce RTX 4090 D Cuda:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
编译过程
(Powerinfer2) root@autodl-container-02b744a905-865b3ab4:~/PowerInfer# cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.1.105")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.1.105
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
GNU ld (GNU Binutils for Ubuntu) 2.38
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done
-- Generating done
-- Build files have been written to: /root/PowerInfer/build
[ 1%] Building C object CMakeFiles/ggml.dir/ggml.c.o
/root/PowerInfer/ggml.c: In function ‘ggml_get_n_tasks’:
/root/PowerInfer/ggml.c:2006:24: warning: array subscript 71 is above array bounds of ‘const char *[70]’ [-Warray-bounds]
2006 | return GGML_OP_NAME[op];
| ~~~~~~~~~~~~^~~~
/root/PowerInfer/ggml.c:1586:21: note: while referencing ‘GGML_OP_NAME’
1586 | static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
| ^~~~~~~~~~~~
In file included from /usr/include/stdio.h:894,
from /root/PowerInfer/ggml.c:21:
In function ‘printf’,
inlined from ‘ggml_graph_print’ at /root/PowerInfer/ggml.c:18011:9:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:112:10: warning: ‘%16s’ directive argument is null [-Wformat-overflow=]
112 | return __printf_chk (__USE_FORTIFY_LEVEL - 1, __fmt, __va_arg_pack ());
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 2%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 3%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[ 4%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
/root/PowerInfer/ggml-quants.c: In function ‘ggml_axpy_q4_0_q8_0’:
/root/PowerInfer/ggml-quants.c:2457:54: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2457 | __m256 by = _mm256_loadu_ps((const __m256 *)((char *)vy+i*128));
| ^
/root/PowerInfer/ggml-quants.c:2457:37: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2457 | __m256 by = _mm256_loadu_ps((const __m256 *)((char *)vy+i*128));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
903 | _mm256_loadu_ps (float const *__P)
| ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2460:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2460 | _mm256_storeu_ps((__m256*)((char*)vz + i*128), by);
| ^
/root/PowerInfer/ggml-quants.c:2460:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2460 | _mm256_storeu_ps((__m256*)((char*)vz + i*128), by);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
909 | _mm256_storeu_ps (float *__P, __m256 __A)
| ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2467:47: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2467 | by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+32));
| ^
/root/PowerInfer/ggml-quants.c:2467:30: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2467 | by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+32));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
903 | _mm256_loadu_ps (float const *__P)
| ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2469:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2469 | _mm256_storeu_ps((__m256*)((char*)vz + i*128+32), by);
| ^
/root/PowerInfer/ggml-quants.c:2469:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2469 | _mm256_storeu_ps((__m256*)((char*)vz + i*128+32), by);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
909 | _mm256_storeu_ps (float *__P, __m256 __A)
| ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2479:47: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2479 | by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+64));
| ^
/root/PowerInfer/ggml-quants.c:2479:30: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2479 | by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+64));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
903 | _mm256_loadu_ps (float const *__P)
| ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2482:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2482 | _mm256_storeu_ps((__m256*)((char*)vz + i*128+64), by);
| ^
/root/PowerInfer/ggml-quants.c:2482:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2482 | _mm256_storeu_ps((__m256*)((char*)vz + i*128+64), by);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
909 | _mm256_storeu_ps (float *__P, __m256 __A)
| ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2489:47: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2489 | by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+96));
| ^
/root/PowerInfer/ggml-quants.c:2489:30: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2489 | by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+96));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
903 | _mm256_loadu_ps (float const *__P)
| ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2491:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
2491 | _mm256_storeu_ps((__m256*)((char*)vz + i*128+96), by);
| ^
/root/PowerInfer/ggml-quants.c:2491:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
2491 | _mm256_storeu_ps((__m256*)((char*)vz + i*128+96), by);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
from /root/PowerInfer/ggml-impl.h:74,
from /root/PowerInfer/ggml-quants.h:3,
from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
909 | _mm256_storeu_ps (float *__P, __m256 __A)
| ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2435:12: warning: unused variable ‘acc’ [-Wunused-variable]
2435 | __m256 acc = _mm256_setzero_ps();
| ^~~
[ 5%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda.cu.o
/root/PowerInfer/ggml-cuda.cu(6717): warning #177-D: variable "ne0" was declared but never referenced
const int64_t ne0 = src->ne[0];
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/root/PowerInfer/ggml-cuda.cu(6979): warning #177-D: variable "nrows_dst" was declared but never referenced
const int64_t nrows_dst = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
^
/root/PowerInfer/ggml-cuda.cu(7204): warning #177-D: variable "ne10" was declared but never referenced
const int64_t ne10 = src1->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(7226): warning #177-D: variable "src1_dfloat" was declared but never referenced
const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i;
^
/root/PowerInfer/ggml-cuda.cu(7255): warning #177-D: variable "ne10" was declared but never referenced
const int64_t ne10 = src1->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(7357): warning #177-D: variable "predict_idx" was declared but never referenced
int predict_idx = idx;
^
/root/PowerInfer/ggml-cuda.cu(8430): warning #177-D: variable "ne01" was declared but never referenced
const int64_t ne01 = src0->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(8773): warning #177-D: variable "all_on_device" was declared but never referenced
bool all_on_device = (src0->backend == GGML_BACKEND_GPU || src0->backend == GGML_BACKEND_GPU_SPLIT) &&
^
/root/PowerInfer/ggml-cuda.cu(4421): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(4549): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(4484): warning #177-D: variable "d" was declared but never referenced
short *d = (short *)((char *)vx + ncols * gpu_row * 2);
^
/root/PowerInfer/ggml-cuda.cu(4492): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(579): warning #177-D: function "sigmoid_f32" was declared but never referenced
void sigmoid_f32(const float * x, float * dst, const int k) {
^
/root/PowerInfer/ggml-cuda.cu(5353): warning #177-D: function "dequantize_mul_mat_vec_q4_0_cuda_sparse" was declared but never referenced
static void dequantize_mul_mat_vec_q4_0_cuda_sparse(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx) {
^
/root/PowerInfer/ggml-cuda.cu(6717): warning #177-D: variable "ne0" was declared but never referenced
const int64_t ne0 = src->ne[0];
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/root/PowerInfer/ggml-cuda.cu(6979): warning #177-D: variable "nrows_dst" was declared but never referenced
const int64_t nrows_dst = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
^
/root/PowerInfer/ggml-cuda.cu(7204): warning #177-D: variable "ne10" was declared but never referenced
const int64_t ne10 = src1->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(7226): warning #177-D: variable "src1_dfloat" was declared but never referenced
const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i;
^
/root/PowerInfer/ggml-cuda.cu(7255): warning #177-D: variable "ne10" was declared but never referenced
const int64_t ne10 = src1->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(7357): warning #177-D: variable "predict_idx" was declared but never referenced
int predict_idx = idx;
^
/root/PowerInfer/ggml-cuda.cu(8430): warning #177-D: variable "ne01" was declared but never referenced
const int64_t ne01 = src0->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(8773): warning #177-D: variable "all_on_device" was declared but never referenced
bool all_on_device = (src0->backend == GGML_BACKEND_GPU || src0->backend == GGML_BACKEND_GPU_SPLIT) &&
^
/root/PowerInfer/ggml-cuda.cu(4421): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(4549): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(4484): warning #177-D: variable "d" was declared but never referenced
short *d = (short *)((char *)vx + ncols * gpu_row * 2);
^
/root/PowerInfer/ggml-cuda.cu(4492): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(579): warning #177-D: function "sigmoid_f32" was declared but never referenced
void sigmoid_f32(const float * x, float * dst, const int k) {
^
/root/PowerInfer/ggml-cuda.cu(5353): warning #177-D: function "dequantize_mul_mat_vec_q4_0_cuda_sparse" was declared but never referenced
static void dequantize_mul_mat_vec_q4_0_cuda_sparse(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx) {
^
/root/PowerInfer/ggml-cuda.cu(6717): warning #177-D: variable "ne0" was declared but never referenced
const int64_t ne0 = src->ne[0];
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/root/PowerInfer/ggml-cuda.cu(6979): warning #177-D: variable "nrows_dst" was declared but never referenced
const int64_t nrows_dst = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
^
/root/PowerInfer/ggml-cuda.cu(7204): warning #177-D: variable "ne10" was declared but never referenced
const int64_t ne10 = src1->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(7226): warning #177-D: variable "src1_dfloat" was declared but never referenced
const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i;
^
/root/PowerInfer/ggml-cuda.cu(7255): warning #177-D: variable "ne10" was declared but never referenced
const int64_t ne10 = src1->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(7357): warning #177-D: variable "predict_idx" was declared but never referenced
int predict_idx = idx;
^
/root/PowerInfer/ggml-cuda.cu(8430): warning #177-D: variable "ne01" was declared but never referenced
const int64_t ne01 = src0->ne[1];
^
/root/PowerInfer/ggml-cuda.cu(8773): warning #177-D: variable "all_on_device" was declared but never referenced
bool all_on_device = (src0->backend == GGML_BACKEND_GPU || src0->backend == GGML_BACKEND_GPU_SPLIT) &&
^
/root/PowerInfer/ggml-cuda.cu(4421): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(4549): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(4484): warning #177-D: variable "d" was declared but never referenced
short *d = (short *)((char *)vx + ncols * gpu_row * 2);
^
/root/PowerInfer/ggml-cuda.cu(4492): warning #177-D: variable "bid" was declared but never referenced
const int bid = blockIdx.y;
^
/root/PowerInfer/ggml-cuda.cu(579): warning #177-D: function "sigmoid_f32" was declared but never referenced
void sigmoid_f32(const float * x, float * dst, const int k) {
^
/root/PowerInfer/ggml-cuda.cu(5353): warning #177-D: function "dequantize_mul_mat_vec_q4_0_cuda_sparse" was declared but never referenced
static void dequantize_mul_mat_vec_q4_0_cuda_sparse(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx) {
^
/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_batch_sparse(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:6962:1: warning: unused parameter ‘src1_ddq_i’ [-Wunused-parameter]
6961 | const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~
6962 | const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
| ^ ~~~~~~
/root/PowerInfer/ggml-cuda.cu:6963:1: warning: unused parameter ‘src1_padded_row_size’ [-Wunused-parameter]
6962 | const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
| ~~~~~~~~~~~~~~~~~
6963 | const int64_t src1_padded_row_size, const cudaStream_t & stream) {
| ^ ~~~~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_vec_sparse_dequantized(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:7251:1: warning: unused parameter ‘src1_ddq_i’ [-Wunused-parameter]
7250 | const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~
7251 | const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
| ^ ~~~~~~
/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_transpose_select_gemm(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:7452:91: warning: cast from type ‘const float*’ to type ‘float*’ casts away qualifiers [-Wcast-qual]
7452 | transpose_cont<<< numBlocks, blockSize, 0, stream>>>((float *)src0_ddf_i, transpose, ne00, ne01, 1, ne00, ne01,NULL);
| ^~~~~~~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu: At global scope:
/root/PowerInfer/ggml-cuda.cu:8771:6: warning: no previous declaration for ‘void ggml_cuda_axpy(const ggml_tensor*, const ggml_tensor*, ggml_tensor*)’ [-Wmissing-declarations]
8771 | void ggml_cuda_axpy(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
| ^~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_transpose_gemm(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:7550:20: warning: ‘src0_ddq_as_f32’ may be used uninitialized in this function [-Wmaybe-uninitialized]
7550 | ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu:7501:8: note: ‘src0_ddq_as_f32’ was declared here
7501 | float * src0_ddq_as_f32;
| ^~~~~~~~~~~~~~~
[ 5%] Built target ggml
[ 6%] Linking CUDA static library libggml_static.a
[ 6%] Built target ggml_static
[ 7%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o
/root/PowerInfer/llama.cpp:632:26: warning: no previous declaration for ‘tensor_offloading_levels get_offloading_level(llm_tensor)’ [-Wmissing-declarations]
632 | tensor_offloading_levels get_offloading_level(llm_tensor tensor) {
| ^~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘int64_t sum_gpu_index(ggml_tensor*)’:
/root/PowerInfer/llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::mem_buffer’ [-Wmissing-field-initializers]
2722 | ggml_context * ctx_aux = ggml_init({
| ~~~~~~~~~^~
2723 | /* mem_size */ 1 << 10,
| ~~~~~~~~~~~~~~~~~~~~~~~
2724 | });
| ~~
/root/PowerInfer/llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::no_alloc’ [-Wmissing-field-initializers]
/root/PowerInfer/llama.cpp: In lambda function:
/root/PowerInfer/llama.cpp:2805:47: warning: unused parameter ‘progress’ [-Wunused-parameter]
2805 | llama_progress_callback cb = [](float progress, void *ctx) {
| ~~~~~~^~~~~~~~
/root/PowerInfer/llama.cpp:2805:63: warning: unused parameter ‘ctx’ [-Wunused-parameter]
2805 | llama_progress_callback cb = [](float progress, void *ctx) {
| ~~~~~~^~~
/root/PowerInfer/llama.cpp: In member function ‘size_t llama_augmentation_model_loader::slice_ffn_mat_to_gpu(llama_layer&)’:
/root/PowerInfer/llama.cpp:2909:23: warning: unused variable ‘gpu_idx’ [-Wunused-variable]
2909 | ggml_tensor * gpu_idx = layer.gpu_idx;
| ^~~~~~~
/root/PowerInfer/llama.cpp: In function ‘void llm_load_sparse_model_tensors(llama_model_loader&, llama_model&, const llama_context_params*, int, long int, bool, bool, bool, llama_progress_callback, void*)’:
/root/PowerInfer/llama.cpp:3165:28: warning: variable ‘llama_backend_offload’ set but not used [-Wunused-but-set-variable]
3165 | enum ggml_backend_type llama_backend_offload = GGML_BACKEND_CPU;
| ^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp:3166:28: warning: variable ‘llama_backend_offload_split’ set but not used [-Wunused-but-set-variable]
3166 | enum ggml_backend_type llama_backend_offload_split = GGML_BACKEND_CPU;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘void llama_reserve_model_kv_cache(llama_model*, const llama_context_params*)’:
/root/PowerInfer/llama.cpp:3319:29: warning: comparison of integer expressions of different signedness: ‘int’ and ‘unsigned int’ [-Wsign-compare]
3319 | if (model->n_gpu_layers < hparams.n_layer + 1) {
| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘std::pair<ggml_tensor*, ggml_tensor*> llm_build_kv_store(ggml_context*, const llama_hparams&, const llama_kv_cache&, ggml_cgraph*, ggml_tensor*, ggml_tensor*, int64_t, int32_t, int32_t, const llm_build_cb&, int64_t)’:
/root/PowerInfer/llama.cpp:4232:31: warning: unused parameter ‘graph’ [-Wunused-parameter]
4232 | struct ggml_cgraph * graph,
| ~~~~~~~~~~~~~~~~~~~~~^~~~~
/root/PowerInfer/llama.cpp: In lambda function:
/root/PowerInfer/llama.cpp:4677:88: warning: unused parameter ‘nl’ [-Wunused-parameter]
4677 | const llm_build_cb no_offload_cb = [](struct ggml_tensor * cur, const char * name, int nl) {
| ~~~~^~
/root/PowerInfer/llama.cpp: In function ‘int llama_decode_internal(llama_context&, llama_batch)’:
/root/PowerInfer/llama.cpp:6592:16: warning: unused variable ‘full_offload_supported’ [-Wunused-variable]
6592 | const bool full_offload_supported =
| ^~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘llama_model_params llama_model_default_params()’:
/root/PowerInfer/llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::reset_gpu_index’ [-Wmissing-field-initializers]
9400 | };
| ^
/root/PowerInfer/llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::disable_gpu_index’ [-Wmissing-field-initializers]
[ 8%] Linking CXX static library libllama.a
[ 8%] Built target llama
[ 9%] Generating build details from Git
-- Found Git: /usr/bin/git (found version "2.34.1")
[ 10%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 10%] Built target build_info
[ 12%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 13%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 14%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 15%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o
[ 16%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o
[ 17%] Linking CXX static library libcommon.a
[ 17%] Built target common
[ 18%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 19%] Linking CXX executable ../bin/test-quantize-fns
[ 19%] Built target test-quantize-fns
[ 20%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 21%] Linking CXX executable ../bin/test-quantize-perf
[ 21%] Built target test-quantize-perf
[ 23%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
[ 24%] Linking CXX executable ../bin/test-sampling
[ 24%] Built target test-sampling
[ 25%] Building CXX object tests/CMakeFiles/test-tokenizer-0-llama.dir/test-tokenizer-0-llama.cpp.o
[ 26%] Linking CXX executable ../bin/test-tokenizer-0-llama
[ 26%] Built target test-tokenizer-0-llama
[ 27%] Building CXX object tests/CMakeFiles/test-tokenizer-0-falcon.dir/test-tokenizer-0-falcon.cpp.o
[ 28%] Linking CXX executable ../bin/test-tokenizer-0-falcon
[ 28%] Built target test-tokenizer-0-falcon
[ 29%] Building CXX object tests/CMakeFiles/test-tokenizer-1-llama.dir/test-tokenizer-1-llama.cpp.o
[ 30%] Linking CXX executable ../bin/test-tokenizer-1-llama
[ 30%] Built target test-tokenizer-1-llama
[ 31%] Building CXX object tests/CMakeFiles/test-tokenizer-1-bpe.dir/test-tokenizer-1-bpe.cpp.o
[ 32%] Linking CXX executable ../bin/test-tokenizer-1-bpe
[ 32%] Built target test-tokenizer-1-bpe
[ 34%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/test-grammar-parser.cpp.o
[ 35%] Linking CXX executable ../bin/test-grammar-parser
[ 35%] Built target test-grammar-parser
[ 36%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/test-llama-grammar.cpp.o
In file included from /root/PowerInfer/tests/test-llama-grammar.cpp:5:
/root/PowerInfer/./llama.cpp:632:26: warning: no previous declaration for ‘tensor_offloading_levels get_offloading_level(llm_tensor)’ [-Wmissing-declarations]
632 | tensor_offloading_levels get_offloading_level(llm_tensor tensor) {
| ^~~~~~~~~~~~~~~~~~~~
In file included from /root/PowerInfer/tests/test-llama-grammar.cpp:5:
/root/PowerInfer/./llama.cpp: In function ‘int64_t sum_gpu_index(ggml_tensor*)’:
/root/PowerInfer/./llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::mem_buffer’ [-Wmissing-field-initializers]
2722 | ggml_context * ctx_aux = ggml_init({
| ~~~~~~~~~^~
2723 | /* mem_size */ 1 << 10,
| ~~~~~~~~~~~~~~~~~~~~~~~
2724 | });
| ~~
/root/PowerInfer/./llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::no_alloc’ [-Wmissing-field-initializers]
/root/PowerInfer/./llama.cpp: In lambda function:
/root/PowerInfer/./llama.cpp:2805:47: warning: unused parameter ‘progress’ [-Wunused-parameter]
2805 | llama_progress_callback cb = [](float progress, void *ctx) {
| ~~~~~~^~~~~~~~
/root/PowerInfer/./llama.cpp:2805:63: warning: unused parameter ‘ctx’ [-Wunused-parameter]
2805 | llama_progress_callback cb = [](float progress, void *ctx) {
| ~~~~~~^~~
/root/PowerInfer/./llama.cpp: In member function ‘size_t llama_augmentation_model_loader::slice_ffn_mat_to_gpu(llama_layer&)’:
/root/PowerInfer/./llama.cpp:2909:23: warning: unused variable ‘gpu_idx’ [-Wunused-variable]
2909 | ggml_tensor * gpu_idx = layer.gpu_idx;
| ^~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘void llm_load_sparse_model_tensors(llama_model_loader&, llama_model&, const llama_context_params*, int, long int, bool, bool, bool, llama_progress_callback, void*)’:
/root/PowerInfer/./llama.cpp:3165:28: warning: variable ‘llama_backend_offload’ set but not used [-Wunused-but-set-variable]
3165 | enum ggml_backend_type llama_backend_offload = GGML_BACKEND_CPU;
| ^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp:3166:28: warning: variable ‘llama_backend_offload_split’ set but not used [-Wunused-but-set-variable]
3166 | enum ggml_backend_type llama_backend_offload_split = GGML_BACKEND_CPU;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘void llama_reserve_model_kv_cache(llama_model*, const llama_context_params*)’:
/root/PowerInfer/./llama.cpp:3319:29: warning: comparison of integer expressions of different signedness: ‘int’ and ‘unsigned int’ [-Wsign-compare]
3319 | if (model->n_gpu_layers < hparams.n_layer + 1) {
| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘std::pair<ggml_tensor*, ggml_tensor*> llm_build_kv_store(ggml_context*, const llama_hparams&, const llama_kv_cache&, ggml_cgraph*, ggml_tensor*, ggml_tensor*, int64_t, int32_t, int32_t, const llm_build_cb&, int64_t)’:
/root/PowerInfer/./llama.cpp:4232:31: warning: unused parameter ‘graph’ [-Wunused-parameter]
4232 | struct ggml_cgraph * graph,
| ~~~~~~~~~~~~~~~~~~~~~^~~~~
/root/PowerInfer/./llama.cpp: In lambda function:
/root/PowerInfer/./llama.cpp:4677:88: warning: unused parameter ‘nl’ [-Wunused-parameter]
4677 | const llm_build_cb no_offload_cb = [](struct ggml_tensor * cur, const char * name, int nl) {
| ~~~~^~
/root/PowerInfer/./llama.cpp: In function ‘int llama_decode_internal(llama_context&, llama_batch)’:
/root/PowerInfer/./llama.cpp:6592:16: warning: unused variable ‘full_offload_supported’ [-Wunused-variable]
6592 | const bool full_offload_supported =
| ^~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘llama_model_params llama_model_default_params()’:
/root/PowerInfer/./llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::reset_gpu_index’ [-Wmissing-field-initializers]
9400 | };
| ^
/root/PowerInfer/./llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::disable_gpu_index’ [-Wmissing-field-initializers]
[ 37%] Linking CXX executable ../bin/test-llama-grammar
[ 37%] Built target test-llama-grammar
[ 38%] Building CXX object tests/CMakeFiles/test-grad0.dir/test-grad0.cpp.o
[ 39%] Linking CXX executable ../bin/test-grad0
[ 39%] Built target test-grad0
[ 40%] Building CXX object tests/CMakeFiles/test-rope.dir/test-rope.cpp.o
[ 41%] Linking CXX executable ../bin/test-rope
[ 41%] Built target test-rope
[ 42%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
[ 43%] Linking CXX executable ../bin/test-c
[ 43%] Built target test-c
[ 45%] Building CXX object examples/baby-llama/CMakeFiles/baby-llama.dir/baby-llama.cpp.o
[ 46%] Linking CXX executable ../../bin/baby-llama
[ 46%] Built target baby-llama
[ 47%] Building CXX object examples/batched/CMakeFiles/batched.dir/batched.cpp.o
[ 48%] Linking CXX executable ../../bin/batched
[ 48%] Built target batched
[ 49%] Building CXX object examples/batched-bench/CMakeFiles/batched-bench.dir/batched-bench.cpp.o
[ 50%] Linking CXX executable ../../bin/batched-bench
[ 50%] Built target batched-bench
[ 51%] Building CXX object examples/beam-search/CMakeFiles/beam-search.dir/beam-search.cpp.o
[ 52%] Linking CXX executable ../../bin/beam-search
[ 52%] Built target beam-search
[ 53%] Building CXX object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-matmult.cpp.o
[ 54%] Linking CXX executable ../../bin/benchmark
[ 54%] Built target benchmark
[ 56%] Building CXX object examples/convert-llama2c-to-ggml/CMakeFiles/convert-llama2c-to-ggml.dir/convert-llama2c-to-ggml.cpp.o
[ 57%] Linking CXX executable ../../bin/convert-llama2c-to-ggml
[ 57%] Built target convert-llama2c-to-ggml
[ 58%] Building CXX object examples/embedding/CMakeFiles/embedding.dir/embedding.cpp.o
[ 59%] Linking CXX executable ../../bin/embedding
[ 59%] Built target embedding
[ 60%] Building CXX object examples/finetune/CMakeFiles/finetune.dir/finetune.cpp.o
[ 61%] Linking CXX executable ../../bin/finetune
[ 61%] Built target finetune
[ 62%] Building CXX object examples/infill/CMakeFiles/infill.dir/infill.cpp.o
[ 63%] Linking CXX executable ../../bin/infill
[ 63%] Built target infill
[ 64%] Building CXX object examples/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 65%] Linking CXX executable ../../bin/llama-bench
[ 65%] Built target llama-bench
[ 67%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
/root/PowerInfer/examples/llava/llava.cpp: In function ‘bool load_file_to_bytes(const char*, unsigned char**, long int*)’:
/root/PowerInfer/examples/llava/llava.cpp:130:10: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
130 | fread(buffer, 1, fileSize, file); // Read the file into the buffer
| ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 68%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
[ 68%] Built target llava
[ 69%] Linking CXX static library libllava_static.a
[ 69%] Built target llava_static
[ 70%] Building CXX object examples/llava/CMakeFiles/llava-cli.dir/llava-cli.cpp.o
[ 71%] Linking CXX executable ../../bin/llava-cli
[ 71%] Built target llava-cli
[ 72%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 73%] Linking CXX executable ../../bin/main
[ 73%] Built target main
[ 74%] Building CXX object examples/parallel/CMakeFiles/parallel.dir/parallel.cpp.o
[ 75%] Linking CXX executable ../../bin/parallel
[ 75%] Built target parallel
[ 76%] Building CXX object examples/perplexity/CMakeFiles/perplexity.dir/perplexity.cpp.o
[ 78%] Linking CXX executable ../../bin/perplexity
[ 78%] Built target perplexity
[ 79%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[ 80%] Linking CXX executable ../../bin/quantize
[ 80%] Built target quantize
[ 81%] Building CXX object examples/quantize-stats/CMakeFiles/quantize-stats.dir/quantize-stats.cpp.o
[ 82%] Linking CXX executable ../../bin/quantize-stats
[ 82%] Built target quantize-stats
[ 83%] Building CXX object examples/save-load-state/CMakeFiles/save-load-state.dir/save-load-state.cpp.o
[ 84%] Linking CXX executable ../../bin/save-load-state
[ 84%] Built target save-load-state
[ 85%] Building CXX object examples/simple/CMakeFiles/simple.dir/simple.cpp.o
[ 86%] Linking CXX executable ../../bin/simple
[ 86%] Built target simple
[ 87%] Building CXX object examples/speculative/CMakeFiles/speculative.dir/speculative.cpp.o
[ 89%] Linking CXX executable ../../bin/speculative
[ 89%] Built target speculative
[ 90%] Building CXX object examples/train-text-from-scratch/CMakeFiles/train-text-from-scratch.dir/train-text-from-scratch.cpp.o
[ 91%] Linking CXX executable ../../bin/train-text-from-scratch
[ 91%] Built target train-text-from-scratch
[ 92%] Building CXX object examples/server/CMakeFiles/server.dir/server.cpp.o
In copy constructor ‘task_result::task_result(const task_result&)’,
inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_result; _Alloc = std::allocator<task_result>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
inlined from ‘void llama_server_context::send_error(int, std::string)’ at /root/PowerInfer/examples/server/server.cpp:1097:32:
/root/PowerInfer/examples/server/server.cpp:154:8: warning: ‘res.task_result::stop’ may be used uninitialized [-Wmaybe-uninitialized]
154 | struct task_result {
| ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In member function ‘void llama_server_context::send_error(int, std::string)’:
/root/PowerInfer/examples/server/server.cpp:1093:21: note: ‘res’ declared here
1093 | task_result res;
| ^~~
In copy constructor ‘task_server::task_server(const task_server&)’,
inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at /root/PowerInfer/examples/server/server.cpp:1259:30,
inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at /root/PowerInfer/examples/server/server.cpp:2355:61:
/root/PowerInfer/examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
145 | struct task_server {
| ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In lambda function:
/root/PowerInfer/examples/server/server.cpp:1253:21: note: ‘task’ declared here
1253 | task_server task;
| ^~~~
In copy constructor ‘task_server::task_server(const task_server&)’,
inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at /root/PowerInfer/examples/server/server.cpp:1259:30,
inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at /root/PowerInfer/examples/server/server.cpp:2410:61:
/root/PowerInfer/examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
145 | struct task_server {
| ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In lambda function:
/root/PowerInfer/examples/server/server.cpp:1253:21: note: ‘task’ declared here
1253 | task_server task;
| ^~~~
In copy constructor ‘task_server::task_server(const task_server&)’,
inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at /root/PowerInfer/examples/server/server.cpp:1259:30,
inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at /root/PowerInfer/examples/server/server.cpp:2514:61:
/root/PowerInfer/examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
145 | struct task_server {
| ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In lambda function:
/root/PowerInfer/examples/server/server.cpp:1253:21: note: ‘task’ declared here
1253 | task_server task;
| ^~~~
[ 93%] Linking CXX executable ../../bin/server
[ 93%] Built target server
[ 94%] Building CXX object examples/export-lora/CMakeFiles/export-lora.dir/export-lora.cpp.o
[ 95%] Linking CXX executable ../../bin/export-lora
[ 95%] Built target export-lora
[ 96%] Building CXX object pocs/vdot/CMakeFiles/vdot.dir/vdot.cpp.o
[ 97%] Linking CXX executable ../../bin/vdot
[ 97%] Built target vdot
[ 98%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o
[100%] Linking CXX executable ../../bin/q8dot
[100%] Built target q8dot
运行过程:
(Powerinfer2) root@autodl-container-02b744a905-865b3ab4:~/PowerInfer/build/bin# /root/PowerInfer/build/bin/main -m /root/autodl-tmp/llm/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf -n 128 -t 8 -p "In the depths of" --ignore-eos
Log start
main: build = 1579 (3f638f7)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1717592515
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9
llama_model_loader: loaded meta data with 22 key-value pairs and 443 tensors from /root/autodl-tmp/llm/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor 0: token_embd.weight f16 [ 5120, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_down_t.weight f16 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_gate.weight f16 [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight f16 [ 5120, 13824, 1, 1 ]
............
llama_model_loader: - tensor 440: blk.38.fc2.weight f16 [ 2048, 13824, 1, 1 ]
llama_model_loader: - tensor 441: blk.39.fc1.weight f16 [ 5120, 2048, 1, 1 ]
llama_model_loader: - tensor 442: blk.39.fc2.weight f16 [ 2048, 13824, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 362 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly F16
llm_load_print_meta: model params = 14.16 B
llm_load_print_meta: model size = 26.38 GiB (16.00 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size = 0.16 MB
llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: offloaded layers from VRAM budget(24735645696 bytes): 41/40
llm_load_sparse_model_tensors: mem required = 27009.74 MB
llm_load_sparse_model_tensors: VRAM used: 10497.08 MB
............................................................
CUDA error 1 at /root/PowerInfer/ggml-cuda.cu:8949: invalid argument
current device: 0