PyTorch Knowledge Patch

Claude's baseline knowledge covers PyTorch through 2.5. This skill provides changes from PyTorch 2.6 through 2.11 (2025-01 to 2026-03).

Quick Reference — Key API Changes

Feature	API	Since
Safe loading default	`torch.load()` now `weights_only=True`	2.6
Compile stance control	`torch.compiler.set_stance("eager_on_recompile")`	2.6
Custom Triton ops	`@torch.library.triton_op("lib::name", mutates_args=())`	2.6
Auto dynamic shapes	`Dim.AUTO` in `torch.export`	2.6
Mega cache (portable)	`torch.compiler.save_cache_artifacts()` / `load_cache_artifacts()`	2.7
Context parallelism	`context_parallel(mesh)` context manager for SDPA	2.7
Foreach map	`torch._foreach_map(fn, tensors, ...)`	2.7
Control flow ops	`cond`, `while_loop`, `scan`, `associative_scan`	2.8
Hierarchical compile	`torch.compiler.nested_compile_region()`	2.8
DCP SafeTensors	`dcp.FileSystemWriter(path, format="safetensors")`	2.8
FSDP1 deprecated	Use `fully_shard()` (FSDP2) instead	2.8
Symmetric memory	`torch.ops.symm_mem` for in-kernel collectives	2.9
Graph break errors	`torch._dynamo.error_on_graph_break()`	2.9
Variable-length attn	`varlen_attn(q, k, v, cu_seqlens_q, ...)`	2.10
TorchScript deprecated	Use `torch.export` instead	2.10
Deterministic compile	`torch.use_deterministic_algorithms(True)` applies to compile	2.10
DebugMode	`torch.debugging.DebugMode()` for numerical debugging	2.10
Differentiable collectives	Functional collectives support backprop	2.11
FlexAttention + FA4	Auto FA4 kernels on Hopper/Blackwell	2.11
CUDA 13 default	CUDA 12.8 via `download.pytorch.org/whl/cu128`	2.11

BREAKING: torch.load defaults to weights_only=True (2.6)

torch.load("file.pt") now uses weights_only=True by default. Loading full nn.Module objects will fail.

# Old code that breaks:
model = torch.load("model.pt")  # fails if saved with torch.save(model)

# Fix: load state_dict (recommended)
model.load_state_dict(torch.load("model.pt", weights_only=True))

# Fix: explicitly opt into unsafe loading
model = torch.load("model.pt", weights_only=False)

For tensor subclasses/numpy arrays, use torch.serialization.safe_globals to allowlist classes.

FSDP2: fully_shard (replaces FSDP1)

FSDP1 (FullyShardedDataParallel wrapper) is deprecated since 2.8. Use FSDP2:

from torch.distributed.fsdp import fully_shard

model = Transformer()
for layer in model.layers:
    fully_shard(layer)  # Shard each layer
fully_shard(model)       # Shard root

# Parameters become DTensors, sharded on dim-0
# Optimizer constructed AFTER fully_shard
optim = torch.optim.Adam(model.parameters(), lr=1e-2)

See references/distributed-training.md for context parallelism, symmetric memory, differentiable collectives, and SafeTensors DCP support.

torch.compile Improvements

Mega Cache — Portable Compilation Artifacts (2.7)

artifacts = torch.compiler.save_cache_artifacts()
# Save to disk, transfer to another machine...
torch.compiler.load_cache_artifacts(artifacts)

Hierarchical Compilation — Compile Once, Reuse (2.8)

@torch.compile
def model_forward(x):
    for layer in layers:
        with torch.compiler.nested_compile_region():
            x = layer(x)  # Compiled once, reused for all layers
    return x

Control Flow Without Graph Breaks (2.8)

Five operators: cond, while_loop, scan, associative_scan, map.

from torch._higher_order_ops.cond import cond
from torch._higher_order_ops.scan import scan

result = cond(pred_tensor, true_fn, false_fn, operands)
carry, outputs = scan(combine_fn, init_carry, xs)

error_on_graph_break() — Targeted Graph Break Errors (2.9)

with torch._dynamo.error_on_graph_break():
    # Errors on graph breaks here (unlike fullgraph which is all-or-nothing)
    compiled_fn(x)

See references/torch-compile.md for set_stance and deterministic mode.

torch.export & Custom Ops

Dim.AUTO — Automatic Dynamic Shapes (2.6)

from torch.export import export, Dim
ep = export(model, (x,), dynamic_shapes={"x": {0: Dim.AUTO}})
# Automatically infers min/max ranges, relations between dims, static/dynamic behavior

torch.library.triton_op — Custom Triton Kernels (2.6)

@torch.library.triton_op("mylib::add", mutates_args=())
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    output = torch.empty_like(x)
    # launch triton kernel...
    return output

See references/export-and-ops.md for foreach_map and TorchScript deprecation details.

Attention Ops

varlen_attn() — Variable-Length Sequences (2.10)

from torch.nn.attention.varlen import varlen_attn

# q, k, v are packed (total_tokens, num_heads, head_dim)
# cu_seqlens marks sequence boundaries: [0, seq1_len, seq1_len+seq2_len, ...]
output = varlen_attn(q, k, v, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k)
# Supports forward + backward, torch.compile-able. Requires A100+, BF16/FP16.

FlexAttention + FlashAttention-4 Backend (2.11)

FlexAttention on Hopper/Blackwell GPUs automatically uses FA4 kernels: 1.2x–3.2x speedup over Triton backend on compute-bound workloads. No code changes needed — automatic via flex_attention().

See references/attention.md for details.

Numerical Debugging — DebugMode (2.10)

from torch.debugging import DebugMode

with DebugMode():
    output = model(x)
# Logs all dispatched ops with tensor hashes
# Compare hashes between two runs to find divergence point

Deprecations & Compatibility

TorchScript (2.10): Use torch.export instead of torch.jit.script/torch.jit.trace. Use ExecuTorch for embedded runtime.
FSDP1 (2.8): Use fully_shard() (FSDP2) instead of FullyShardedDataParallel.

Environment

CUDA 13 is the default since 2.11. CUDA 12.8 builds available via download.pytorch.org/whl/cu128.
Python 3.14 supported since 2.10. Python 3.14t (free-threaded) experimentally supported.
Deterministic compile (2.10): torch.use_deterministic_algorithms(True) now applies to torch.compile.

See references/environment.md for details on all compatibility changes.

Reference Files

File	Contents
`torch-compile.md`	set_stance, mega cache, hierarchical compilation, control flow ops, error_on_graph_break, deterministic mode
`distributed-training.md`	FSDP2 fully_shard, context parallelism, symmetric memory, differentiable collectives, SafeTensors DCP
`export-and-ops.md`	Dim.AUTO, triton_op, TorchScript deprecation, foreach_map
`attention.md`	varlen_attn for packed sequences, FlexAttention + FA4 backend
`environment.md`	weights_only=True breaking change, CUDA 13 default, Python 3.14, DebugMode

pytorch-knowledge-patch