离线安装Ubuntu显卡机
文章目录
说明
部分英伟达显卡机是不能联网的,需要离线安装服务器环境,这里使用ubuntu-22.04.5-live-server-amd64.iso , 需要有一台联网服务器(或者虚拟机)下载需要的deb包,拷贝到显卡机,进行安装
注意:使用初始化干净的Ubuntu系统,已经安装的deb包,apt-get install -y --download-only不会再次下载
基础环境
## 使用国内的apt源,这里使用 https://mirrors.aliyun.com
sed -i 's/http:\/\/cn.archive.ubuntu.com/https:\/\/mirrors.aliyun.com/g' /etc/apt/sources.list
sed -i 's/http:\/\/security.ubuntu.com/https:\/\/mirrors.aliyun.com/g' /etc/apt/sources.list
## 配置docker仓库
# 卸载旧版本
apt remove docker docker-engine docker.io containerd runc
# 更新软件源
sudo apt update
# 安装所需依赖
sudo apt -y install apt-transport-https ca-certificates curl software-properties-common
# 安装 Docker GPG 证书
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker-aliyun.gpg
# 新增 Docker 软件源信息
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker-aliyun.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
# 安装 nvidia-container-runtime GPG 证书
curl -fsSL https://mirrors.ustc.edu.cn/nvidia-container-runtime/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# 新增 nvidia-container-runtime 软件源信息
echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://mirrors.ustc.edu.cn/nvidia-container-runtime/stable/deb/$(dpkg --print-architecture) /" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
下载依赖
apt clean
apt update
## 下载docker依赖包
apt-get install -y --download-only docker-ce docker-ce-cli containerd.io docker-compose-plugin
## 下载nvidia-container-runtime依赖包
apt-get install -y --download-only nvidia-container-runtime
## 安装显卡驱动的依赖,linux-modules-extra-5.15.0-119-generic $(uname -sr)需要匹配实际的内核版本
apt-get install -y --download-only build-essential libboost-program-options-dev cmake zip unzip rdma-core infiniband-diags ibverbs-providers libibverbs-dev dpkg-dev perl linux-modules-extra-5.15.0-119-generic
###下载的deb包都在 /var/cache/apt/archives
mkdir -p ./deps/
cp -rf /var/cache/apt/archives/*.deb ./deps/
## 下载并安装以下包
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager_580.167.08-1ubuntu1_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/libnvidia-nscq_580.167.08-1ubuntu1_amd64.deb
##B300需要
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvlsm_2025.10.14-1_amd64.deb
sudo depmod -a
sudo modprobe ib_umad
#sudo modprobe rdma_ucm
#sudo modprobe ib_uverbs
#sudo modprobe ib_verbs
#sudo modprobe ib_core
lsmod | grep ib
# nvidia-fabricmanager 服务
systemctl start nvidia-fabricmanager
systemctl enable nvidia-fabricmanager
## 下载显卡驱动,可能需要梯子使用浏览器下载
wget https://us.download.nvidia.com/tesla/580.167.08/NVIDIA-Linux-x86_64-580.167.08.run
wget https://developer.download.nvidia.com/compute/cuda/13.0.3/local_installers/cuda_13.0.3_580.126.20_linux.run
## 压测工具
#gpu-burn: https://github.com/wilicc/gpu-burn
#p2pBandwidthLatencyTest: https://github.com/NVIDIA/cuda-samples
#nvbandwidth: https://github.com/NVIDIA/nvbandwidth
sglang运行GLM5.2-NVFP4
docker-compose.yaml
services:
sglang-glm52-nvfp4:
image: lmsysorg/sglang:dev-cu13-glm52-nvfp4
container_name: sglang-glm52-nvfp4
restart: unless-stopped
network_mode: host
privileged: true
runtime: nvidia
ipc: host
shm_size: 128g
#ports:
# - "8000:8000"
ulimits:
memlock:
soft: -1
hard: -1
stack:
soft: 67108864
hard: 67108864
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
# 宿主机模型路径改成你实际存放GLM5.2-NVFP4的目录
- /data/.cache/huggingface/hub/models--nvidia--GLM-5.2-NVFP4/snapshots/aec724e8c7b8ee9db3b48c01c320f63f9cdaf8aa:/app/models/GLM-5.2-NVFP4
command: >
sglang serve
--model-path /app/models/GLM-5.2-NVFP4
--served-model-name GLM5.2
--tensor-parallel-size 8
--quantization modelopt_fp4
--tool-call-parser glm47
--reasoning-parser glm45
--trust-remote-code
--chunked-prefill-size 65536
--mem-fraction-static 0.85
--host 0.0.0.0
--port 8000
--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--model-path: 模型本地路径,指定加载大模型权重文件所在目录--served-model-name: 对外服务模型名,接口请求时填写的模型名称,自定义别名--tensor-parallel-size 8: 张量并行数,拆分模型权重分到 8 张 GPU 运行,多卡均分负载--quantization modelopt_fp4: 量化方案,使用 NVIDIA ModelOpt FP4 4 比特权重量化,大幅省显存--tool-call-parser glm47: 工具调用解析器,适配 GLM47 系列格式解析函数调用、插件调用逻辑--reasoning-parser glm45: 思维链推理解析器,适配 GLM45 格式解析模型内部思考 / 推理内容--trust-remote-code: 信任远程代码,自动执行模型仓库内自定义建模代码,加载非标准架构模型必备--chunked-prefill-size 65536: 分块预填充长度,超长上下文分块编码,支持更大输入文本,数值越大支持越长 prompt--mem-fraction-static 0.85: 静态显存占用比例,限制模型权重最多占用单卡 85% 显存,预留显存给推理 / 缓存--host 0.0.0.0: 监听地址,允许局域网 / 外网所有 IP 访问服务--port 8000: 服务端口,API 接口默认监听 8000 端口--speculative-algorithm NEXTN: 投机解码算法,选用 NextN 极速投机推理算法,加速生成速度--speculative-num-steps 3: 投机迭代步数,单次推理执行 3 轮投机校验--speculative-eagle-topk 1: Eagle 采样候选数,仅取 Top1 最优候选 token,提升稳定性--speculative-num-draft-tokens 4: 预生成草稿 token 数,一次提前预推 4 个预测 token,显著提升生成吞吐
压测
# 设置 huggingface 国内镜像
export HF_ENDPOINT=https://hf-mirror.com
# --max-concurrency 1 单用户压测
python3 -m sglang.bench_serving --backend sglang --model nvidia/GLM-5.2-NVFP4 --dataset-name random --random-input-len 307680 --random-output-len 2048 --num-prompts 10 --max-concurrency 1 --request-rate inf --host 192.168.0.12 --port 8000 --served-model-name GLM5.2 --random-range-ratio 1.0
文章作者
上次更新 2026-07-04