[THUDM/ChatGLM-6B]ptuning自己的数据,出现raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 259)

2024-05-10 787 views
5

Generating train split: 43060 examples [00:00, 344470.14 examples/s]06/07/2023 08:42:53 - ERROR - datasets.packaged_modules.json.json - Failed to read file 'D:\PycharmProject\ChatGLM-6B\data\Child\train.json' with error <class 'pyarrow.lib.Arro wInvalid'>: JSON parse error: Invalid encoding in string. in row 97 Traceback (most recent call last): File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\packaged_modules\json\json.py", line 134, in _generate_tables dataset = json.load(f) File "D:\Anaconda3\envs\chatglm6b\lib\json__init.py", line 296, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "D:\Anaconda3\envs\chatglm6b\lib\json\init__.py", line 348, in loads return _default_decoder.decode(s) File "D:\Anaconda3\envs\chatglm6b\lib\json\decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 259)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\builder.py", line 1858, in _prepare_splitsingle for , table in generator: File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\packaged_modules\json\json.py", line 137, in _generate_tables raise e File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\packaged_modules\json\json.py", line 114, in _generate_tables io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size) File "pyarrow_json.pyx", line 258, in pyarrow._json.read_json File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: JSON parse error: Invalid encoding in string. in row 97

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "main.py", line 430, in main() File "main.py", line 103, in main use_auth_token=True if model_args.use_auth_token else None, File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\load.py", line 1803, in load_dataset storage_options=storage_options, File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\builder.py", line 894, in download_and_prepare download_and_prepare_kwargs, File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\builder.py", line 985, in _download_and_prepare self._prepare_split(split_generator, prepare_split_kwargs) File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\builder.py", line 1747, in _prepare_split gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\builder.py", line 1891, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

上面是报错代码: ptuning v2方式,微调自己的数据,当数据量达到10万多行时,出现上面报错(json的编码没有问题)。网上查询资料,说是因为train.json的数据量太大导致的,我减少的7万行,程序可以正常运行了。 请问

  1. 如何数据量很大(10万行以上)这个报错要怎么解决?
  2. 如何在7行训练好的模型上,接着训练剩下3万行数据?
Steps To Reproduce
  1. 环境如下
  2. 模型参数 PRE_SEQ_LEN=128 LR=2e-2

CUDA_VISIBLE_DEVICES=0 python main.py \ --do_train \ --train_file ../data/Child/train.json \ --validation_file ../data/Child/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path THUDM/chatglm-6b \ --output_dir output/child-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 1100 \ --max_target_length 10000 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 3000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \ --quantization_bit 4

Environment
- OS: windows server2016
- Python: 3.7.16
- Transformers: 4.27.1
- PyTorch: 1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :11.7 True

回答

8

您好,请问如何解决这个问题?

6

我的训练数据7.3万行,也出现了这个错误。我减少到3万行才可以开始微调。

0

很奇怪,我只有5000能跑,,,,这是为啥 我用的NVIDIA A10 1张

6

我之前也遇到这个问题。

请注意在处理数据时保证数据都是【字符串】,不能存在数字格式。

可以查看该链接了解我是如何解决的

# 将全部提示词和回复都变成字符串,防止数字格式错误;
prompts = [str(i) for i in prompts]
responses = [str(i) for i in responses]
6

这个问题解决了吗,我的训练数据量才2w行,就出现了这个错误,v100单卡

4

同问4080,8000条数据就报raise JSONDecodeError("Extra data", s, end)……太奇怪了