如何在低内存GPU上运行NLP+Transformers LLM

霞舞

发布时间：2025-09-15 21:13:01

185人浏览过

来源于php中文网

原创

如何在低内存gpu上运行nlp+transformers llm

本文旨在解决在低内存GPU上运行大型语言模型（LLM）时遇到的内存不足问题。通过模型量化和使用AutoAWQ工具，结合针对特定CUDA版本的兼容性调整，提供了一种可行的解决方案，并详细展示了如何在代码中应用这些优化策略，确保模型能够成功加载并运行在资源受限的环境中.

在尝试运行大型语言模型（LLM）时，尤其是在资源有限的GPU上，经常会遇到内存不足的问题。这通常是因为模型参数量巨大，超出GPU的承载能力。本文将介绍一种有效的解决方案：模型量化，以及如何使用 AutoAWQ 工具来加载和运行量化后的模型。

模型量化简介

模型量化是一种通过降低模型参数的精度来减少模型大小的技术。例如，将原本使用32位浮点数（float32）存储的参数转换为8位整数（int8），从而显著减少模型的内存占用。量化后的模型不仅体积更小，还能提高推理速度，使其更适合在资源受限的设备上运行。

使用 AutoAWQ 加载量化模型

AutoAWQ 是一个专门用于量化和加载模型的库，它能够方便地将大型模型转换为量化版本，并在支持的硬件上高效运行。

1. 安装必要的库

首先，需要安装 transformers 和 accelerate 库，以及特定版本的 AutoAWQ。由于不同环境的CUDA版本可能不同，需要选择与你的CUDA版本兼容的 AutoAWQ 版本。

!pip install -q transformers accelerate
!pip install -q -U https://www.php.cn/link/9b67ff5862eb652ed35b0554cb7f62f2/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

注意: 上面的命令安装的是针对CUDA 11.8的版本。你需要根据你的CUDA版本选择合适的AutoAWQ whl文件。你可以在AutoAWQ的GitHub Releases页面找到其他CUDA版本的whl文件。

2. 加载量化模型和tokenizer

360 AI助手

360公司推出的AI聊天机器人聚合平台，集合了国内15家顶尖的AI大模型。

下载

接下来，使用 AutoAWQForCausalLM 的 from_quantized 方法加载量化后的模型。同时，加载对应的tokenizer。

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = 'TheBloke/neural-chat-7B-v3-1-AWQ'  # 选择量化后的模型
model = AutoAWQForCausalLM.from_quantized(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

注意: TheBloke 在 Hugging Face 上提供了许多量化后的模型，你可以根据你的需求选择合适的模型。

3. 修改推理代码

在推理代码中，需要将输入张量移动到GPU上。这可以通过 .cuda() 方法实现。

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    # Tokenize and encode the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).cuda()

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]

4. 示例运行

# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)