[huggingface/transformers]CodeLLama 使用 flash_attention_2 生成截然不同的输出

系统信息

transformers版本：4.35.0.dev0
平台：Linux-4.18.0-477.10.1.el8_8.x86_64-x86_64-with-glibc2.28
Python版本：3.9.7
Huggingface_hub版本：0.16.4
Safetensors 版本：0.3.1
加速版本：0.23.0
加速配置：
- 计算环境：LOCAL_MACHINE
  - 分布式类型：否
  - 混合精度：bf16
  - use_cpu: 假
  - 调试：错误
  - 进程数：1
  - 机器等级：0
  - 机器数量：1
  - rdzv_backend：静态
  - 相同的网络：错误
  - main_training_function：主要
  - downcast_bf16：错误
  - tpu_use_cluster：错误
  - tpu_use_sudo：错误
PyTorch 版本（GPU？）：2.1.0+cu121（正确）
Tensorflow 版本（GPU？）：未安装（NA）
Flax 版本（CPU？/GPU？/TPU？）：未安装（NA）
Jax版本：未安装
JaxLib版本：未安装
在脚本中使用 GPU？：是 1xNvidia A100
在脚本中使用分布式或并行设置？：否

首先，我们加载带或不带 Flash Attention 的 Codellama 7B。我用torch_dtype=torch.bfloat16

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch

model_weights_name_or_path = "codellama/CodeLlama-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_weights_name_or_path)
tokenizer.pad_token_id = tokenizer.unk_token_id

config = AutoConfig.from_pretrained(
    model_weights_name_or_path,
    trust_remote_code=False,
    # pretraining_tp=1, Setting pretraining_tp=1 doesn't solve the issue
)

model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=model_weights_name_or_path,
        device_map=None,
        max_memory=None,
        quantization_config=None,
        torch_dtype=torch.bfloat16,
        config=config,
        trust_remote_code=True,
        use_flash_attention_2=False
).to("cuda")

model_flash = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=model_weights_name_or_path,
        device_map=None,
        max_memory=None,
        quantization_config=None,
        torch_dtype=torch.bfloat16,
        config=config,
        trust_remote_code=True,
        use_flash_attention_2=True
).to("cuda")

model_flash.eval()
model.eval()

准备输入。我使用的 NER 示例来自：https://github.com/hitz-zentroa/GoLLIE/


prompt = '''
# The following lines describe the task definition
@dataclass
class PrivateSpaceCompany(Entity):
    """Refers to private companies primarily focused on space exploration, transportation,
    satellite launch, or space-based services. These are non-governmental entities that have
    a commercial interest in space activities."""

    span: str  # Such as: "Blue origin", "Boeing", "Northrop Grumman", "Arianespace"

@dataclass
class PublicSpaceCompany(Entity):
    """Refers to governmental entities or agencies that are primarily focused on space
    exploration, research, transportation, satellite launch, or other space-based services.
    These entities are state-owned and operated and are generally funded through public funds.
    """

    span: str  # Such as "ESA", "ISRO", "CNSA"

@dataclass
class Planet(Entity):
    """Refers to celestial bodies that orbit a star. Planets are large enough
    to have cleared their orbits of other debris and have a nearly round shape
    due to their self-gravity."""

    span: str  # Such as: "Earth", "Jupiter", "Venus", "Mercury", "Saturn"

@dataclass
class Launcher(Entity):
    """Refers to a vehicle designed primarily to transport payloads from the Earth's
    surface to space. Launchers can carry various payloads, including satellites,
    crewed spacecraft, and cargo, into various orbits or even beyond Earth's orbit.
    They are usually multi-stage vehicles that use rocket engines for propulsion."""

    span: str  # Such as: "Sturn V", "Atlas V", "Soyuz", "Ariane 5"

# This is the text to analyze
text = "SpaceX is colaborating with NASA in the mission to bring humans to Mars using their new Starship rocket."

# The annotation instances that take place in the text above are listed here
result ='''.strip()

model_input = tokenizer(prompt, add_special_tokens=True, return_tensors="pt")

最后，我们可以使用相同的输入运行两个模型

model_flash_ouput = model_flash.generate(
    **model_input.to(model_flash.device),
    max_new_tokens=128,
    do_sample=False,
    min_new_tokens=0,
    num_beams=1,
    num_return_sequences=1,
)

model_ouput = model.generate(
    **model_input.to(model.device),
    max_new_tokens=128,
    do_sample=False,
    min_new_tokens=0,
    num_beams=1,
    num_return_sequences=1,
)

print(tokenizer.batch_decode(model_flash_ouput)[0])
print(tokenizer.batch_decode(model_ouput)[0])

结果（为简洁起见，我跳过输入标记）是：


# FLASH ATTENTION 2 MODEL
result = [ <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE> <PRE>

# BASELINE MODEL
result = [
    PrivateSpaceCompany(span="SpaceX"),
    PublicSpaceCompany(span="NASA"),
    Planet(span="Mars"),
    Launcher(span="Starship"),
]

预期行为

两个模型应该产生相同的结果。

ikergarcia1996

cc @younesbelkada 非常确定我们有 Llama 的 logits 测试和生成，但没有 CodeLlama

ArthurZucker

@ArthurZucker @younesbelkada 似乎在将 Flash Attention 从版本更新2.0.8到2.3.2.

ikergarcia1996

非常好@ikergarcia1996！事实上，我们这里也有类似的问题：https ://github.com/huggingface/transformers/issues/26697

younesbelkada

[huggingface/transformers]CodeLLama 使用 flash_attention_2 生成截然不同的输出

回答