检查时
std::vector<llama_token> tokens = llama_tokenize(ctx, "[PAD50277]", false, true);
mpt
解析为令牌 50277 。这在我看来是错误的,我也应该尝试在 PR 中修复这个问题吗?
我检查了 HF 分词器的行为
# tests with HF tokenizer
import argparse
from transformers import AutoTokenizer
parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()
dir_tokenizer = args.dir_tokenizer
tokenizer = AutoTokenizer.from_pretrained(dir_tokenizer)
added_vocab = tokenizer.get_added_vocab()
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}
with open("result.txt", "w", encoding="utf-8") as file:
for i in range(tokenizer.vocab_size + len([id for id in added_vocab.values() if id >= tokenizer.vocab_size]) + 1):
s = tokenizer.decode([i])
e = tokenizer.encode(s)
if i not in reverse_vocab:
file.write(f"pad {i} {s} {e}\n")
elif reverse_vocab[i] in added_vocab:
if tokenizer.added_tokens_decoder[i].special:
file.write(f"special {i} {s} {e}\n")
else:
file.write(f"added {i} {s} {e}\n")
# else:
# file.write(f"normal {i} {s} {e}\n")
并获得llama-2-7B
special 0 <unk> [1, 0]
special 1 <s> [1, 1]
special 2 </s> [1, 2]
pad 32000 [1]
并为mpt
special 0 <|endoftext|> [0]
special 1 <|padding|> [1]
added 50254 [50254]
added 50255 [50255]
added 50256 [50256]
added 50257 [50257]
added 50258 [50258]
added 50259 [50259]
added 50260 [50260]
added 50261 [50261]
added 50262 [50262]
added 50263 [50263]
added 50264 [50264]
added 50265 [50265]
added 50266 [50266]
added 50267 [50267]
added 50268 [50268]
added 50269 [50269]
added 50270 [50270]
added 50271 [50271]
added 50272 [50272]
added 50273 [50273]
added 50274 [50274]
added 50275 [50275]
added 50276 [50276]
pad 50277 []
并为causalml
special 151643 <|endoftext|> [151643]
special 151644 <|im_start|> [151644]
special 151645 <|im_end|> [151645]
pad 151851 []
和stablelm
(stablelm-3b-4e1t)
^special 0 <|endoftext|> [0]
special 1 <|padding|> [1]
added 50254 [50254]
added 50255 [50255]
added 50256 [50256]
added 50257 [50257]
added 50258 [50258]
added 50259 [50259]
added 50260 [50260]
added 50261 [50261]
added 50262 [50262]
added 50263 [50263]
added 50264 [50264]
added 50265 [50265]
added 50266 [50266]
added 50267 [50267]
added 50268 [50268]
added 50269 [50269]
added 50270 [50270]
added 50271 [50271]
added 50272 [50272]
added 50273 [50273]
added 50274 [50274]
added 50275 [50275]
added 50276 [50276]
pad 50277 []
这对我来说表明我们的特殊令牌处理(CONTROL)是不同的,并且我们对填充令牌的处理可能存在问题。