Generating train split: 43060 examples [00:00, 344470.14 examples/s]06/07/2023 08:42:53 - ERROR - datasets.packaged_modules.json.json - Failed to read file 'D:\PycharmProject\ChatGLM-6B\data\Child\train.json' with error <class 'pyarrow.lib.Arro wInvalid'>: JSON parse error: Invalid encoding in string. in row 97 Traceback (most recent call last): File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\packaged_modules\json\json.py", line 134, in _generate_tables dataset = json.load(f) File "D:\Anaconda3\envs\chatglm6b\lib\json__init.py", line 296, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "D:\Anaconda3\envs\chatglm6b\lib\json\init__.py", line 348, in loads return _default_decoder.decode(s) File "D:\Anaconda3\envs\chatglm6b\lib\json\decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 259)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\builder.py", line 1858, in _prepare_splitsingle for , table in generator: File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\packaged_modules\json\json.py", line 137, in _generate_tables raise e File "D:\Anaconda3\envs\chatglm6b\lib\site-packages\datasets\packaged_modules\json\json.py", line 114, in _generate_tables io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size) File "pyarrow_json.pyx", line 258, in pyarrow._json.read_json File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: JSON parse error: Invalid encoding in string. in row 97
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 430, in
上面是报错代码: ptuning v2方式,微调自己的数据,当数据量达到10万多行时,出现上面报错(json的编码没有问题)。网上查询资料,说是因为train.json的数据量太大导致的,我减少的7万行,程序可以正常运行了。 请问
- 如何数据量很大(10万行以上)这个报错要怎么解决?
- 如何在7行训练好的模型上,接着训练剩下3万行数据?
- 环境如下
- 模型参数 PRE_SEQ_LEN=128 LR=2e-2
CUDA_VISIBLE_DEVICES=0 python main.py \ --do_train \ --train_file ../data/Child/train.json \ --validation_file ../data/Child/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path THUDM/chatglm-6b \ --output_dir output/child-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 1100 \ --max_target_length 10000 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 3000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \ --quantization_bit 4
Environment- OS: windows server2016
- Python: 3.7.16
- Transformers: 4.27.1
- PyTorch: 1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :11.7 True