torch_npu 能力与使用指引

何时使用本 Skill

用户 @torch_npu、提到昇腾 NPU、CANN、Ascend、或 PyTorch 在 NPU 上运行。
用户需要 环境检查/部署：检查或部署 PyTorch、检查环境是否支持 NPU。
用户使用 @torch_npu_doc 时：基于本 skill 的 reference.md 提供项目内中文文档能力说明与操作步骤。
用户需要 NPU 格式转换的代码提示：书写或补全 torch_npu.npu_format_cast、npu_format_cast_、torch_npu.Format、get_npu_format 时，按 §2.1 子能力提供参数与枚举提示；若不理解该功能，可建议通过本 skill 的 MCP 抓取版本文档辅助说明。

1. 环境检查与部署

1.1 自动检查 PyTorch 与 NPU 环境

在回答或生成脚本时，按需执行或建议用户执行以下检查：

检查 PyTorch 与 Python 版本是否在配套范围内（参见 README.zh.md 中的「PyTorch与Python版本配套表」）：

支持 PyTorch 1.11.0～2.9.0 等多版本，对应 Python 3.7～3.11（视具体 PyTorch 版本而定）。

检查环境是否支持 NPU：

import torch
import torch_npu  # 2.5.1 及以后可不显式 import，仍建议写以便兼容

# 是否可用 NPU、设备数量
if torch.npu.is_available():
    count = torch.npu.device_count()
    # 使用 device='npu' 或 .npu()
else:
    # 未安装 CANN / 未 source set_env.sh / 无 NPU 设备
    pass

检查 CANN 环境变量（安装后验证）：

使用前需执行：source /usr/local/Ascend/ascend-toolkit/set_env.sh（路径以实际 CANN 安装为准）。
若 ASCEND_HOME_PATH、ASCEND_OPP_PATH 未设置或路径不存在，torch_npu 会报错并提示执行 source set_env.sh。

1.2 部署步骤摘要

安装 CANN：按 CANN 安装指南安装，并与 README.zh.md 中「昇腾辅助软件」表核对 CANN 与 PyTorch/Extension 版本。
安装 PyTorch：x86 用 pip3 install torch==2.7.1+cpu --index-url https://download.pytorch.org/whl/cpu，aarch64 用 pip3 install torch==2.7.1（示例版本，按需替换）。
安装依赖：pip3 install pyyaml setuptools。
安装 torch_npu：pip3 install torch-npu==2.7.1（版本需与 PyTorch、CANN 配套）。
验证：source set_env.sh 后运行快速验证代码（见 README.zh.md「安装后验证」）。

2. torch_npu 能力目录（简略）

类别	能力说明
设备与内存	`torch.npu`：设备管理、`device_count`、`current_device`、`set_device`、`synchronize`、`Stream`/`Event`、内存统计与分配（`memory_allocated`、`empty_cache`、`MemPool` 等）。
张量/存储	`tensor.npu()`、`tensor.is_npu`、NPU Storage、序列化 `torch.save`/`load` 支持 NPU，DDP/多进程 reductions。
训练/优化	`torch.npu.amp` 混合精度、`torch_npu.optim`、FSDP 补丁（`ShardedGradScaler`）、梯度检查点默认 NPU。
分布式	`torch_npu.distributed`：HCCL/LCCL 后端、`is_hccl_available`、`reinit_process_group`、RPC、symmetric memory、DTensor 规则。
扩展 API	`torch_npu.contrib`：NMS、IoU 系列、ROIAlign、DCN、FusedAttention、自定义模块（如 `DropoutWithByteMask`）等。
图与编译	NPU Graph（`npugraphify`）、Dynamo、Inductor、torch.compile 支持 NPU。
推理/ONNX	ONNX 导出与 NPU 定制算子封装（如 OneHot、RoiAlign、NMS、FastGelu、MultiHeadAttention 等）。
profiling	`torch_npu.profiler`、MSTX 补丁、性能 dump。
其他	HiFloat8Tensor、erase_stream、matmul_checksum、transfer_to_npu（可选）、op_plugin。

详细 API 以昇腾 Ascend Extension for PyTorch 自定义 API 参考及项目 README.zh.md 为准。

2.1 子能力：torch_npu.npu_format_cast 代码提示

当用户在代码中书写或询问 torch_npu.npu_format_cast、npu_format_cast_、NPU 张量格式转换时，应提供以下代码提示与补全指引，便于在 IDE 中完成各项提示。

API 签名与参数

torch_npu.npu_format_cast(tensor, acl_format, customize_dtype=None)
- tensor：NPU 上的 torch.Tensor（需先 .npu()）。
- acl_format：目标存储格式，可为 int 或 torch_npu.Format 枚举成员。
- customize_dtype：可选，用于 ONNX 等场景的自定义 dtype。
- 返回：新张量（不修改原张量）。
torch_npu.npu_format_cast_(tensor, acl_format)
- 同上，但为 in-place 版本，直接修改 tensor 的格式。
torch_npu.get_npu_format(tensor)
- 返回张量当前 NPU 存储格式（torch_npu.Format 或整型）。

常用 Format 枚举（torch_npu.Format）

在代码提示中可优先提示以下常用值（来自 torch_npu.npu._format.Format）：

枚举名	值	常见用途
`Format.NCHW`	0	默认 4D 卷积布局
`Format.NHWC`	1	通道在后的 4D 布局
`Format.ND`	2	通用 ND 布局
`Format.NC1HWC0`	3	Conv/BatchNorm 等算子常用
`Format.FRACTAL_Z`	4	3D 卷积等
`Format.FRACTAL_NZ`	29	线性/矩阵乘、Attention 权重等
`Format.NDC1HWC0`	32	5D
`Format.FRACTAL_Z_3D`	33	3D 卷积
`Format.UNDEFINED`	-1	未定义

其他可选：NC1HWC0_C04(12)、HWCN(16)、NDHWC(27)、NCDHW(30)、NC(35)、NCL(47)、FRACTAL_NZ_C0_*(50–54) 等。

代码提示与补全规则

补全第二参数：当用户输入 torch_npu.npu_format_cast(x, 时，提示 acl_format 可选为 int 或 torch_npu.Format.xxx，并列出常用枚举（如 Format.NCHW、Format.NHWC、Format.FRACTAL_NZ、Format.NC1HWC0）。
补全 Format 枚举：当用户输入 torch_npu.Format. 时，提示上述枚举成员列表。
配对使用：若代码中已有 get_npu_format(t)，在需要转成相同格式时，可提示 torch_npu.npu_format_cast(other, torch_npu.get_npu_format(t))。
常见场景：
- 线性层权重量子化/迁移到 NPU：torch_npu.npu_format_cast(weight.npu(), 29)（FRACTAL_NZ）；
- 与参数格式一致的梯度：torch_npu.npu_format_cast(p.grad, torch_npu.get_npu_format(p))；
- 模块迁移时 BN/Conv 的 NC1HWC0：torch_npu.npu_format_cast(tensor, 3) 或 Format.NC1HWC0。

文档来源说明

若需更权威的格式说明或与 CANN 的对应关系，可让用户通过本 skill 的 MCP（见 mcp/README.md）使用 fetch_torch_npu_doc 抓取 GitCode Ascend/pytorch 对应分支下的版本文档（如 docs/zh 下与 format 或算子适配相关的文档），或在 reference 中查找「框架特性指南」相关条目。

3. 如何自动提供文档中的能力（@torch_npu_doc）

当用户 @torch_npu_doc 或明确要求查阅 torch_npu 中文文档时：
- 优先使用本 skill 的 reference 文档：reference.md。其中包含项目内全部中文文档的索引与关键内容摘要。
- 根据用户问题在 reference.md 的「文档索引」中定位到对应文档路径（如 torch_npu/README.zh.md、torch_npu/docs/zh/quick_start/quick_start.md、torch_npu/docs/zh/framework_feature_guide_pytorch/、torch_npu/docs/zh/troubleshooting/ 等），按需读取工作区内该文件并引用相关段落。
- 对安装、验证、卸载、版本配套：采用 README.zh.md 或 reference 中的摘要；对迁移与快速入门：采用 quick_start；对算子适配、内存、性能、图模式：采用 framework_feature_guide_pytorch 下对应文档；对报错与故障：采用 troubleshooting 下对应文档。
回答时用中文简要说明，并注明所引用文档路径便于用户自行查阅。

4. 参考资源

本 skill 文档参考：reference.md。汇总 torch_npu 项目内所有中文文档索引与要点；用户 @torch_npu_doc 时据此定位并引用具体文档。
项目中文说明：工作区内 torch_npu/README.zh.md（版本表、部署、验证、参考文档链接）。
官方文档：昇腾社区 Ascend Extension for PyTorch、软件安装指南、模型迁移与训练、算子适配、PyTorch 原生接口清单、自定义 API 参考（见 README.zh.md 底部表格）。
从 GitCode 抓取版本文档：本 skill 提供 MCP 服务，可从 GitCode Ascend/pytorch 按分支抓取文档并保存到 fetched_docs/。安装与配置见 mcp/README.md；项目已配置 .cursor/mcp.json，安装依赖后重启 Cursor 即可使用工具 fetch_torch_npu_doc、fetch_torch_npu_docs_batch、list_torch_npu_doc_paths。