# 重要更新
飞桨框架 3.3 版本在大模型训练效率、开发体验及国产硬件适配等关键领域持续突破创新，在计算显存高效利用、训推衔接转换、生态兼容性、调试效率、国产硬件适配等方面带来多项重要升级，全面提升大模型训推能力。
## 训练效率突破

* **FlashMaskV3 升级**：深度优化稀疏注意力掩码计算 FlashMaskV3 计算内核，性能全面超越 FlexAttention，算子性能最高领先 2.1 倍；原生支持上下文并行并引入计算负载均衡机制，分布式场景算子性能相比 Megatron-LM 快 80%，全面强化长文训练能力。
* **FlexCheckpoint 参数自动切分重组系统**：基于首创的轻量级描述语言 AOA (All in One Arrow)，支持从单卡视角灵活描述复杂的权重转换关系，并自动推导分片映射；通过跨机通信的高并发与负载均衡调度，在大参数规模下权重转换性能相比 Megatron-LM 领先 1.2 倍以上，有效解决大模型训推不同阶段参数转换的成本与效率难题。
* **虚拟内存动态碎片整理**：引入基于虚拟内存技术的显存分配机制，可根据运行时显存占用情况动态进行碎片整理。在主流 MoE 模型训练中，显存碎片率从超过 10%降至最低 3%，显著提升显存利用率。

## 开发体验优化

* **生态兼容**：通过框架 API、算子注册、执行调度等关键链路的兼容设计，实现无缝衔接使用外部生态算子，支持 FlashInfer、FlashMLA、DeepGEMM、DeepEP、TorchCodec 和 SonicMoE 等社区高性能模块的高效集成应用。
* **动态图调试能力升级**：新增动态图前反向计算图可视化功能，支持导出算子调用栈与张量 MD5 校验信息；系统优化关键路径日志，新增局部日志打印功能，提升调试信息丰富度与获取便捷性。
* **显存观测工具**：新增显存观测功能，可视化显存池中内存块分布，支持追踪特定代码段的显存申请/释放与全局状态，助力大模型显存异常的精准定位与处置。

## 国产硬件深度适配

* **昆仑芯 XPU**：系统完善 MoE 场景适配，为相关算子新增 bool、bfloat16、complex64 数据类型支持，并对 FlashAttention、DeepEP、Profiler 等模块进行深度适配。
* **海光 DCU**：支持 Hygon 数学库后端，进一步优化海光芯片推理性能。

## 1. 执行调度机制
针对大模型预训练、后训练与推理阶段中，因分布式策略及组网差异导致的权重转换困难，飞桨 FlexCheckpoint 机制创新提出高效的权重重组方法与灵活的模型编辑原语 AOA，为模型研发到生产全流程提供高效、统一的分布式参数转换与重组支持。该机制全面适配训推衔接、跨策略断点续训、生态兼容格式加载导出、强化学习参数同步等多种场景下的权重转换需求，并在大参数规模场景下实现超越 Megatron-LM 1.2 倍以上的转换性能，从根本上解决了分布式参数转换成本高、效率低的难题。
在显存管理方面，针对 MoE 模型因专家路由动态性导致的显存碎片率高、资源浪费严重等问题，本版本推出了基于虚拟内存管理技术的高性能 VMM Allocator。该分配器可在模型运行期间根据系统显存使用情况，动态自适应执行碎片整理，显著提升显存利用效率。
飞桨 3.3 版本持续深耕自动并行与 AI 编译器技术。在自动并行架构中新增对 FSDP 等策略的支持，并增强了动态 Shape 流水线并行功能；面向科学计算场景，实现了对高阶导数的支持，并拓展了相关算子切分推导规则的覆盖范围，进一步提升了自动并行架构的通用性与易用性。此外，围绕推理场景，优化了 AI 编译器 CINN 及动转静 SOT 功能，显著提升易用性与执行调度性能。

**新特性**
* FlexCheckpoint 支持在线合并参数功能。[#75613](https://github.com/PaddlePaddle/Paddle/pull/75613),  [#76510](https://github.com/PaddlePaddle/Paddle/pull/76510)
* 实现基于虚拟内存管理技术的高性能 VMM Allocator。[#75323](https://github.com/PaddlePaddle/Paddle/pull/75323),  [#76222](https://github.com/PaddlePaddle/Paddle/pull/76222), [#76223](https://github.com/PaddlePaddle/Paddle/pull/76223), [#76389](https://github.com/PaddlePaddle/Paddle/pull/76389), [#76430](https://github.com/PaddlePaddle/Paddle/pull/76430), [#76454](https://github.com/PaddlePaddle/Paddle/pull/76454), [#76523](https://github.com/PaddlePaddle/Paddle/pull/76523), [#76544](https://github.com/PaddlePaddle/Paddle/pull/76554), [#76730](https://github.com/PaddlePaddle/Paddle/pull/76730), [#76793](https://github.com/PaddlePaddle/Paddle/pull/76793), [#77196](https://github.com/PaddlePaddle/Paddle/pull/77196)
* 自动并行支持高阶微分。[#75689](https://github.com/PaddlePaddle/Paddle/pull/75689)
* 自动并行支持 FSDP 策略。[#76113](https://github.com/PaddlePaddle/Paddle/pull/76113), [#76868](https://github.com/PaddlePaddle/Paddle/pull/76868)
* 自动并行优化与增强 argsort，bmm，elementwise，index_select，matmul，softmax，tile，transpose 等算子的自动切分推导。[#74826](https://github.com/PaddlePaddle/Paddle/pull/74826), [#74829](https://github.com/PaddlePaddle/Paddle/pull/74829), [#75036](https://github.com/PaddlePaddle/Paddle/pull/75036), [#75044](https://github.com/PaddlePaddle/Paddle/pull/75044), [#75050](https://github.com/PaddlePaddle/Paddle/pull/75050), [#75095](https://github.com/PaddlePaddle/Paddle/pull/75095), [#75246](https://github.com/PaddlePaddle/Paddle/pull/75246), [#75265](https://github.com/PaddlePaddle/Paddle/pull/75265), [#75555](https://github.com/PaddlePaddle/Paddle/pull/75555)
* 动转静支持局部捕获控制流。[#75548](https://github.com/PaddlePaddle/Paddle/pull/75548)、[#76198](https://github.com/PaddlePaddle/Paddle/pull/76198)
* 动转静支持 Python 3.14 版本相关语法。 [#75853](https://github.com/PaddlePaddle/Paddle/pull/75853), [#75879](https://github.com/PaddlePaddle/Paddle/pull/75879), [#75971](https://github.com/PaddlePaddle/Paddle/pull/75971),[#76072](https://github.com/PaddlePaddle/Paddle/pull/76072), [#76257](https://github.com/PaddlePaddle/Paddle/pull/76257),[#76288](https://github.com/PaddlePaddle/Paddle/pull/76288), [#76320](https://github.com/PaddlePaddle/Paddle/pull/76320), [#76416](https://github.com/PaddlePaddle/Paddle/pull/76416), [#76451](https://github.com/PaddlePaddle/Paddle/pull/76451), [#76804](https://github.com/PaddlePaddle/Paddle/pull/76804)
* 支持在静态图 PIR 上注册 Python 函数，为 Triton、DeepGEMM 等 JIT 算子的转静提供图表示能力。 [#76888](https://github.com/PaddlePaddle/Paddle/pull/76888),  [#76938](https://github.com/PaddlePaddle/Paddle/pull/76938)
* 动态图支持 view 反向高阶微分。[#76667](https://github.com/PaddlePaddle/Paddle/pull/76667)

**功能增强**
* FlexCheckpoint 优化 AOA 宏展开功能，为 fuse 类宏支持传入 axis 属性。[#75282](https://github.com/PaddlePaddle/Paddle/pull/75282)
* FlexCheckpoint AOA 解析功能增强，支持优化器状态与模型状态共用一套 AOA 标记，切分信息传递、模型加载与存储共用一套 AOA 标记。[#75613](https://github.com/PaddlePaddle/Paddle/pull/75613),  [#76013](https://github.com/PaddlePaddle/Paddle/pull/76013),  [#76437](https://github.com/PaddlePaddle/Paddle/pull/76437)
* FlexCheckpoint 支持 ShardingStage2 和 ShardingStage3 策略。[#76309](https://github.com/PaddlePaddle/Paddle/pull/76309),  [#76538](https://github.com/PaddlePaddle/Paddle/pull/76538)
* 优化 FlexCheckpoint 报错信息。 [#76813](https://github.com/PaddlePaddle/Paddle/pull/76813),  [#77266](https://github.com/PaddlePaddle/Paddle/pull/77266)
* 自动并行中层 API 流水线 Hook 支持处理元组对象。[#75081](https://github.com/PaddlePaddle/Paddle/pull/75081)
* 自动并行流水线并行策略支持动态 shape。[#75724](https://github.com/PaddlePaddle/Paddle/pull/75724)
* 支持自动并行场景下的 FlexCheckpoint 机制及优化器状态字典的分片处理。[#76240](https://github.com/PaddlePaddle/Paddle/pull/76240), [#76305](https://github.com/PaddlePaddle/Paddle/pull/76305)
* 升级 DLPack 到 v1.2 版本，全面支持 TVM FFI，支持 C 函数交换协议、DataType 交换协议、Device 交换协议等新特性。[#75193](https://github.com/PaddlePaddle/Paddle/pull/75193), [#75205](https://github.com/PaddlePaddle/Paddle/pull/75205), [#75650](https://github.com/PaddlePaddle/Paddle/pull/75650),  [#75854](https://github.com/PaddlePaddle/Paddle/pull/75854), [#75973](https://github.com/PaddlePaddle/Paddle/pull/75973), [#76828](https://github.com/PaddlePaddle/Paddle/pull/76828), [#76673](https://github.com/PaddlePaddle/Paddle/pull/76673)
* 将 ComparePriority 函数返回类型从 bool 升级为 int，以满足 std::sort 的严格弱序要求，并引入 SortComparePriority 包装函数确保排序算法正确性。[#76027](https://github.com/PaddlePaddle/Paddle/pull/76027)
* 动转静支持禁用编译超时自动回退功能。[#76386](https://github.com/PaddlePaddle/Paddle/pull/76386)
* 统一 EqualAllOpInferSymbolicShape 函数与 InferMeta 逻辑，提升编译器与框架动态图调度结果的一致性。[#76477](https://github.com/PaddlePaddle/Paddle/pull/76477)
* 为 all_reduce，c_allreduce_sum，c_concat，c_identity，flash_attn_unpadded，mp_allreduce_sum 等算子添加 infer_symbolic_shape 接口，支持符号化形状推导。[#76783](https://github.com/PaddlePaddle/Paddle/pull/76783), [#76836](https://github.com/PaddlePaddle/Paddle/pull/76836)

**性能优化**
* 将 FlexCheckpoint 保存阶段重组摊平权重的通信操作延后至加载阶段执行，减少高频保存操作的耗时。[#75613](https://github.com/PaddlePaddle/Paddle/pull/75613)
* FlexCheckpoint 支持使用 Grouped Send Recv 进行参数重切分通信。[#76779](https://github.com/PaddlePaddle/Paddle/pull/76779), [#76810](https://github.com/PaddlePaddle/Paddle/pull/76810)
* FlexCheckpoint 支持权重冗余存储，减少热启时权重重切分时间。[#76857](https://github.com/PaddlePaddle/Paddle/pull/76857)
* CINN 支持编译 Kernel 缓存功能，节省编译耗时开销。 [#75989](https://github.com/PaddlePaddle/Paddle/pull/75989), [#76825](https://github.com/PaddlePaddle/Paddle/pull/76825), [#76853](https://github.com/PaddlePaddle/Paddle/pull/76853)
* 优化动转静子图打断率，提升整体转静性能。[#76104](https://github.com/PaddlePaddle/Paddle/pull/76104), [#76354](https://github.com/PaddlePaddle/Paddle/pull/76354), [#76641](https://github.com/PaddlePaddle/Paddle/pull/76641),  [#76862](https://github.com/PaddlePaddle/Paddle/pull/76862)
* CINN 化简过长的 shape 表达式以减少符号推导耗时。[#76969](https://github.com/PaddlePaddle/Paddle/pull/76969)

**Bug 修复**
* 修复 FlexCheckpoint 部分功能不支持在非分布式环境下使用的问题。 [#75413](https://github.com/PaddlePaddle/Paddle/pull/75413), [#76272](https://github.com/PaddlePaddle/Paddle/pull/76272)
* 修复 AOA 转置功能解析错误问题。[#76234 ](https://github.com/PaddlePaddle/Paddle/pull/76234)
* 修复 FlexCheckpoint 读文件仲裁负载不均的问题。[#76536](https://github.com/PaddlePaddle/Paddle/pull/76536)
* 修复 AOA 解析耗时过长问题。 [#76639](https://github.com/PaddlePaddle/Paddle/pull/76639)
* 修复 FlexCheckpoint 加载时日志误报权重 Shape 不匹配的问题。[#76958](https://github.com/PaddlePaddle/Paddle/pull/76958)
* 修复 shard_dataloader 对非 Tensor 数据的兼容问题。[#75252](https://github.com/PaddlePaddle/Paddle/pull/75252)
* 修复自动并行反向算子输入同时存在 DenseTensor 和 DistTensor 时，分布式张量转换段错误的问题。[#75691](https://github.com/PaddlePaddle/Paddle/pull/75691)
* 修复 H20 环境下序列并行精度问题。[#76150](https://github.com/PaddlePaddle/Paddle/pull/76150)
* 修复自动并行 sharding stage2/stage3 策略在混合精度训练场景的 bug。[#76462](https://github.com/PaddlePaddle/Paddle/pull/76462)
* 修复动转静下冗余 memcpy 导致 CUDAGraph 失败的问题。[#75078](https://github.com/PaddlePaddle/Paddle/pull/75078)
* 修复 CINN 对 float16 类型的兼容性问题。[#75090](https://github.com/PaddlePaddle/Paddle/pull/75090)
* 修复框架资源泄漏，逻辑错误，死代码等问题。[#75332](https://github.com/PaddlePaddle/Paddle/pull/75332), [#75334](https://github.com/PaddlePaddle/Paddle/pull/75334), [#75338](https://github.com/PaddlePaddle/Paddle/pull/75338), [#75339](https://github.com/PaddlePaddle/Paddle/pull/75339), [#75340](https://github.com/PaddlePaddle/Paddle/pull/75340), [#75349](https://github.com/PaddlePaddle/Paddle/pull/75349), [#75353](https://github.com/PaddlePaddle/Paddle/pull/75353), [#75438](https://github.com/PaddlePaddle/Paddle/pull/75438), [#75439](https://github.com/PaddlePaddle/Paddle/pull/75439), [#75440](https://github.com/PaddlePaddle/Paddle/pull/75440), [#75441](https://github.com/PaddlePaddle/Paddle/pull/75441), [#75442](https://github.com/PaddlePaddle/Paddle/pull/75442), [#75444](https://github.com/PaddlePaddle/Paddle/pull/75444), [#75445](https://github.com/PaddlePaddle/Paddle/pull/75445), [#75448](https://github.com/PaddlePaddle/Paddle/pull/75448), [#75449](https://github.com/PaddlePaddle/Paddle/pull/75449), [#75450](https://github.com/PaddlePaddle/Paddle/pull/75450), [#75451](https://github.com/PaddlePaddle/Paddle/pull/75451), [#75453](https://github.com/PaddlePaddle/Paddle/pull/75453), [#75469](https://github.com/PaddlePaddle/Paddle/pull/75469), [#75498](https://github.com/PaddlePaddle/Paddle/pull/75498), [#75516](https://github.com/PaddlePaddle/Paddle/pull/75516), [#75517](https://github.com/PaddlePaddle/Paddle/pull/75517), [#75518](https://github.com/PaddlePaddle/Paddle/pull/75518), [#75519](https://github.com/PaddlePaddle/Paddle/pull/75519), [#75520](https://github.com/PaddlePaddle/Paddle/pull/75520),  [#75749](https://github.com/PaddlePaddle/Paddle/pull/75749), [#75750](https://github.com/PaddlePaddle/Paddle/pull/75750), [#75753](https://github.com/PaddlePaddle/Paddle/pull/75753), [#75754](https://github.com/PaddlePaddle/Paddle/pull/75754), [#75755](https://github.com/PaddlePaddle/Paddle/pull/75755),  [#75756](https://github.com/PaddlePaddle/Paddle/pull/75756), [#75757](https://github.com/PaddlePaddle/Paddle/pull/75757), [#75759](https://github.com/PaddlePaddle/Paddle/pull/75759),[#75761](https://github.com/PaddlePaddle/Paddle/pull/75761), [#75762](https://github.com/PaddlePaddle/Paddle/pull/75762), [#75764](https://github.com/PaddlePaddle/Paddle/pull/75764), [#75765](https://github.com/PaddlePaddle/Paddle/pull/75765), [#75766](https://github.com/PaddlePaddle/Paddle/pull/75766), [#75767](https://github.com/PaddlePaddle/Paddle/pull/75767), [#75768](https://github.com/PaddlePaddle/Paddle/pull/75768),  [#75769](https://github.com/PaddlePaddle/Paddle/pull/75769), [#75770](https://github.com/PaddlePaddle/Paddle/pull/75770), [#75771](https://github.com/PaddlePaddle/Paddle/pull/75771), [#75772](https://github.com/PaddlePaddle/Paddle/pull/75772), [#75774](https://github.com/PaddlePaddle/Paddle/pull/75774), [#75775](https://github.com/PaddlePaddle/Paddle/pull/75775), [#75776](https://github.com/PaddlePaddle/Paddle/pull/75776),  [#75777](https://github.com/PaddlePaddle/Paddle/pull/75777),  [#75779](https://github.com/PaddlePaddle/Paddle/pull/75779), [#75780](https://github.com/PaddlePaddle/Paddle/pull/75780), [#75781](https://github.com/PaddlePaddle/Paddle/pull/75781), [#75782](https://github.com/PaddlePaddle/Paddle/pull/75782), [#75783](https://github.com/PaddlePaddle/Paddle/pull/75783),  [#75784](https://github.com/PaddlePaddle/Paddle/pull/75784), [#75785](https://github.com/PaddlePaddle/Paddle/pull/75785), [#75786](https://github.com/PaddlePaddle/Paddle/pull/75786), [#75787](https://github.com/PaddlePaddle/Paddle/pull/75787), [#75788](https://github.com/PaddlePaddle/Paddle/pull/75788), [#75789](https://github.com/PaddlePaddle/Paddle/pull/75789), [#75790](https://github.com/PaddlePaddle/Paddle/pull/75790), [#75791](https://github.com/PaddlePaddle/Paddle/pull/75791), [#75792](https://github.com/PaddlePaddle/Paddle/pull/75792),  [#75802](https://github.com/PaddlePaddle/Paddle/pull/75802), [#75798](https://github.com/PaddlePaddle/Paddle/pull/75798), [#75803](https://github.com/PaddlePaddle/Paddle/pull/75803), [#75812](https://github.com/PaddlePaddle/Paddle/pull/75812), [#75813](https://github.com/PaddlePaddle/Paddle/pull/75813), [#75819](https://github.com/PaddlePaddle/Paddle/pull/75819), [#75820](https://github.com/PaddlePaddle/Paddle/pull/75820), [#75822](https://github.com/PaddlePaddle/Paddle/pull/75822), [#75823](https://github.com/PaddlePaddle/Paddle/pull/75823), [#75959](https://github.com/PaddlePaddle/Paddle/pull/75959), [#76049](https://github.com/PaddlePaddle/Paddle/pull/76049),[#76052](https://github.com/PaddlePaddle/Paddle/pull/76052),  [#76054](https://github.com/PaddlePaddle/Paddle/pull/76054), [#76047](https://github.com/PaddlePaddle/Paddle/pull/76047), [#76058](https://github.com/PaddlePaddle/Paddle/pull/76058), [#76077](https://github.com/PaddlePaddle/Paddle/pull/76077),[#76078](https://github.com/PaddlePaddle/Paddle/pull/76078), [#76102](https://github.com/PaddlePaddle/Paddle/pull/76102),[#76463](https://github.com/PaddlePaddle/Paddle/pull/76463), [#76483](https://github.com/PaddlePaddle/Paddle/pull/76483)
* 修复控制流 if block 存在内置 parameter 算子时动态符号推导链断掉问题。[#75378](https://github.com/PaddlePaddle/Paddle/pull/75378), [#76103](https://github.com/PaddlePaddle/Paddle/pull/76103)
* 修复 StrategyForArangeSymbolic 中 all_static 情况下 attrs 存储无效 false 值的 bug。[#75837](https://github.com/PaddlePaddle/Paddle/pull/75837)
* 修复自动微分前反向数据类型不一致问题。[#75840](https://github.com/PaddlePaddle/Paddle/pull/75840)
* 修复 Crop 算子的 CINN 符号推导 Bug。[#75992](https://github.com/PaddlePaddle/Paddle/pull/75992)
* 修复 DLPack 在 stream 参数处理、stride 转换等场景的实现问题。[#76840](https://github.com/PaddlePaddle/Paddle/pull/76840), [#77063](https://github.com/PaddlePaddle/Paddle/pull/77063)
* 修复一系列 slice 和 stride 相关机制问题。 [#75794](https://github.com/PaddlePaddle/Paddle/pull/75794), [#76004](https://github.com/PaddlePaddle/Paddle/pull/76004), [#76211](https://github.com/PaddlePaddle/Paddle/pull/76211), [#76967](https://github.com/PaddlePaddle/Paddle/pull/76967)
* 修复 CINN 在 0-size 动态 shape 场景下 infer shape 阶段未正确初始化 shape 的 bug。[#76093](https://github.com/PaddlePaddle/Paddle/pull/76093)
* 修复动转静组网期错误将 parameter OP 添加到子 block 等问题。[#76190](https://github.com/PaddlePaddle/Paddle/pull/76190)

## 2. 算子优化完善
为突破大模型注意力掩码计算复杂度高、存储占用大导致的训练效率瓶颈，飞桨持续优化并打磨自主创新的列式稀疏注意力掩码计算技术 FlashMask。Paddle 3.3 新升级的 FlashMask V3 版本首次提出前向持久化抢占式 Tile 调度器（Persistent Preemptive Tile Scheduler，PPT），实现 GPU 流式多处理器（SM）间的计算负载均衡，并原生支持长文本上下文并行下的复杂掩码注意力计算，全面强化长文训练能力，在多种掩码模式下实现训练性能质的飞跃：相比上一版本性能最高提升超过 1.4 倍，全面超越 FlexAttention，单卡性能最高领先 FlexAttention 达 2.1 倍，分布式性能最高领先 Megatron-LM 分布式实现版本 80%。除 FlashMask 外，Paddle 3.3 版本对主流 MoE 模型的高频算子 kernel 进行了专项精度优化，并增强 stride、超大 Tensor 等场景的支持能力，大幅改进框架算子在超大规模模型训练场景的数值精度、稳定性和健壮性。

**新特性**
* 新增 API： paddle.compat.nn.functional.linear，paddle.dot，paddle.is_floating_point，paddle.is_tensor，paddle.isin。 [#76144](https://github.com/PaddlePaddle/Paddle/pull/76144), [#75150](https://github.com/PaddlePaddle/Paddle/pull/75150),  [#75032](https://github.com/PaddlePaddle/Paddle/pull/75032)
* 新增 Flashinfer 支持。[#75075](https://github.com/PaddlePaddle/Paddle/pull/75075)
* FlashMask V3 实现 block mask，支持稀疏 Attention 计算。[#76407](https://github.com/PaddlePaddle/Paddle/pull/76407)

**功能增强**
* register_forward_pre_hook 支持 prepend、with_kwargs 和 always_call 等参数。 [#74611](https://github.com/PaddlePaddle/Paddle/pull/74611)
* prod/sum 增加 out 参数。 [#75004](https://github.com/PaddlePaddle/Paddle/pull/75004)
* remainder 兼容 floor_divide 与 masked_select 行为差异。 [#75148](https://github.com/PaddlePaddle/Paddle/pull/75148)
* 接口支持参数别名与 scalar 输入，并补齐 out。 [#75163](https://github.com/PaddlePaddle/Paddle/pull/75163), [#75317](https://github.com/PaddlePaddle/Paddle/pull/75317)
* FlashMask v3 支持大于 128 的 head dim 计算。[#76365](https://github.com/PaddlePaddle/Paddle/pull/76365)
* pylayer 支持 set_grad_in_dtype_consistent。 [#76537](https://github.com/PaddlePaddle/Paddle/pull/76537)
* FlashMask 的 startend_row_indices 支持 q head 维度的独立 mask 设置。[#77469](https://github.com/PaddlePaddle/Paddle/pull/77469)
* C Sink paddle.nn.functional.gelu 支持参数别名与用法差异。 [#75210](https://github.com/PaddlePaddle/Paddle/pull/75210)
* 提升了一系列 API 的运算精度，包括：conv 系列，matmul 系列，interpolate 系列，reduce 系列等。[#75237](https://github.com/PaddlePaddle/Paddle/pull/75237), [#75238](https://github.com/PaddlePaddle/Paddle/pull/75238), [#75335](https://github.com/PaddlePaddle/Paddle/pull/75335), [#75341](https://github.com/PaddlePaddle/Paddle/pull/75341), [#75355](https://github.com/PaddlePaddle/Paddle/pull/75355), [#75363](https://github.com/PaddlePaddle/Paddle/pull/75363), [#75367](https://github.com/PaddlePaddle/Paddle/pull/75367), [#75379](https://github.com/PaddlePaddle/Paddle/pull/75379), [#75426](https://github.com/PaddlePaddle/Paddle/pull/75426), [#75454](https://github.com/PaddlePaddle/Paddle/pull/75454), [#75503](https://github.com/PaddlePaddle/Paddle/pull/75503), [#75525](https://github.com/PaddlePaddle/Paddle/pull/75525), [#75547](https://github.com/PaddlePaddle/Paddle/pull/75547), [#75549](https://github.com/PaddlePaddle/Paddle/pull/75549), [#75588](https://github.com/PaddlePaddle/Paddle/pull/75588), [#75605](https://github.com/PaddlePaddle/Paddle/pull/75605), [#75717](https://github.com/PaddlePaddle/Paddle/pull/75717), [#75799](https://github.com/PaddlePaddle/Paddle/pull/75799), [#75816](https://github.com/PaddlePaddle/Paddle/pull/75816), [#75898](https://github.com/PaddlePaddle/Paddle/pull/75898), [#75965](https://github.com/PaddlePaddle/Paddle/pull/75965), [#75968](https://github.com/PaddlePaddle/Paddle/pull/75968), [#75970](https://github.com/PaddlePaddle/Paddle/pull/75970), [#76066](https://github.com/PaddlePaddle/Paddle/pull/76066), [#76224](https://github.com/PaddlePaddle/Paddle/pull/76224), [#76231](https://github.com/PaddlePaddle/Paddle/pull/76231), [#76246](https://github.com/PaddlePaddle/Paddle/pull/76246), [#76398](https://github.com/PaddlePaddle/Paddle/pull/76398), [#76553](https://github.com/PaddlePaddle/Paddle/pull/76553), [#76590](https://github.com/PaddlePaddle/Paddle/pull/76590), [#76723](https://github.com/PaddlePaddle/Paddle/pull/76723), [#76735](https://github.com/PaddlePaddle/Paddle/pull/76735), [#76758](https://github.com/PaddlePaddle/Paddle/pull/76758), [#76814](https://github.com/PaddlePaddle/Paddle/pull/76814), [#76846](https://github.com/PaddlePaddle/Paddle/pull/76846), [#76922](https://github.com/PaddlePaddle/Paddle/pull/76922), [#76980](https://github.com/PaddlePaddle/Paddle/pull/76980), [#77098](https://github.com/PaddlePaddle/Paddle/pull/77098), [#77143](https://github.com/PaddlePaddle/Paddle/pull/77143), [#77149](https://github.com/PaddlePaddle/Paddle/pull/77149)
* CUDA 11.8 支持 fused_rms_norm_ext。[#76624](https://github.com/PaddlePaddle/Paddle/pull/76624)
* 默认启用 cudnn_frontend。 [#76735](https://github.com/PaddlePaddle/Paddle/pull/76735)

**性能优化**
* FlashMask v3 通过引入 PPT 负载均衡调度器、TileSize 调优及 kernel 控制流优化等多项技术手段，提升 kernel 整体性能。[#75984](https://github.com/PaddlePaddle/Paddle/pull/75984), [#76003](https://github.com/PaddlePaddle/Paddle/pull/76003), [#76216](https://github.com/PaddlePaddle/Paddle/pull/76216)
* 优化 FusedRope 内核，提升计算性能。[#76824](https://github.com/PaddlePaddle/Paddle/pull/76824)

**Bug 修复**
* 修复了一系列 API 的大 Tensor、0-Size、stride 相关问题。[#74851](https://github.com/PaddlePaddle/Paddle/pull/74851), [#74860](https://github.com/PaddlePaddle/Paddle/pull/74860), [#75142](https://github.com/PaddlePaddle/Paddle/pull/75142), [#75261](https://github.com/PaddlePaddle/Paddle/pull/75261), [#75341](https://github.com/PaddlePaddle/Paddle/pull/75341), [#75506](https://github.com/PaddlePaddle/Paddle/pull/75506), [#75523](https://github.com/PaddlePaddle/Paddle/pull/75523), [#75536](https://github.com/PaddlePaddle/Paddle/pull/75536), [#75537](https://github.com/PaddlePaddle/Paddle/pull/75537), [#75538](https://github.com/PaddlePaddle/Paddle/pull/75538), [#75539](https://github.com/PaddlePaddle/Paddle/pull/75539), [#75540](https://github.com/PaddlePaddle/Paddle/pull/75540), [#75541](https://github.com/PaddlePaddle/Paddle/pull/75541), [#75542](https://github.com/PaddlePaddle/Paddle/pull/75542), [#75543](https://github.com/PaddlePaddle/Paddle/pull/75543), [#75545](https://github.com/PaddlePaddle/Paddle/pull/75545), [#75554](https://github.com/PaddlePaddle/Paddle/pull/75554), [#75562](https://github.com/PaddlePaddle/Paddle/pull/75562), [#75577](https://github.com/PaddlePaddle/Paddle/pull/75577), [#75578](https://github.com/PaddlePaddle/Paddle/pull/75578), [#75580](https://github.com/PaddlePaddle/Paddle/pull/75580), [#75581](https://github.com/PaddlePaddle/Paddle/pull/75581), [#75596](https://github.com/PaddlePaddle/Paddle/pull/75596), [#75601](https://github.com/PaddlePaddle/Paddle/pull/75601), [#75607](https://github.com/PaddlePaddle/Paddle/pull/75607), [#75608](https://github.com/PaddlePaddle/Paddle/pull/75608), [#75614](https://github.com/PaddlePaddle/Paddle/pull/75614), [#75615](https://github.com/PaddlePaddle/Paddle/pull/75615), [#75616](https://github.com/PaddlePaddle/Paddle/pull/75616), [#75625](https://github.com/PaddlePaddle/Paddle/pull/75625), [#75626](https://github.com/PaddlePaddle/Paddle/pull/75626), [#75633](https://github.com/PaddlePaddle/Paddle/pull/75633), [#75636](https://github.com/PaddlePaddle/Paddle/pull/75636), [#75637](https://github.com/PaddlePaddle/Paddle/pull/75637), [#75639](https://github.com/PaddlePaddle/Paddle/pull/75639), [#75640](https://github.com/PaddlePaddle/Paddle/pull/75640), [#75641](https://github.com/PaddlePaddle/Paddle/pull/75641), [#75642](https://github.com/PaddlePaddle/Paddle/pull/75642), [#75643](https://github.com/PaddlePaddle/Paddle/pull/75643), [#75644](https://github.com/PaddlePaddle/Paddle/pull/75644), [#75645](https://github.com/PaddlePaddle/Paddle/pull/75645), [#75647](https://github.com/PaddlePaddle/Paddle/pull/75647), [#75655](https://github.com/PaddlePaddle/Paddle/pull/75655), [#75658](https://github.com/PaddlePaddle/Paddle/pull/75658), [#75659](https://github.com/PaddlePaddle/Paddle/pull/75659), [#75660](https://github.com/PaddlePaddle/Paddle/pull/75660), [#75661](https://github.com/PaddlePaddle/Paddle/pull/75661), [#75662](https://github.com/PaddlePaddle/Paddle/pull/75662), [#75663](https://github.com/PaddlePaddle/Paddle/pull/75663), [#75664](https://github.com/PaddlePaddle/Paddle/pull/75664), [#75665](https://github.com/PaddlePaddle/Paddle/pull/75665), [#75666](https://github.com/PaddlePaddle/Paddle/pull/75666), [#75667](https://github.com/PaddlePaddle/Paddle/pull/75667), [#75673](https://github.com/PaddlePaddle/Paddle/pull/75673), [#75699](https://github.com/PaddlePaddle/Paddle/pull/75699), [#75700](https://github.com/PaddlePaddle/Paddle/pull/75700), [#75701](https://github.com/PaddlePaddle/Paddle/pull/75701), [#75703](https://github.com/PaddlePaddle/Paddle/pull/75703), [#75704](https://github.com/PaddlePaddle/Paddle/pull/75704), [#75706](https://github.com/PaddlePaddle/Paddle/pull/75706), [#75707](https://github.com/PaddlePaddle/Paddle/pull/75707), [#75708](https://github.com/PaddlePaddle/Paddle/pull/75708), [#75709](https://github.com/PaddlePaddle/Paddle/pull/75709), [#75710](https://github.com/PaddlePaddle/Paddle/pull/75710), [#75711](https://github.com/PaddlePaddle/Paddle/pull/75711), [#75713](https://github.com/PaddlePaddle/Paddle/pull/75713), [#75714](https://github.com/PaddlePaddle/Paddle/pull/75714), [#75715](https://github.com/PaddlePaddle/Paddle/pull/75715), [#75716](https://github.com/PaddlePaddle/Paddle/pull/75716), [#75717](https://github.com/PaddlePaddle/Paddle/pull/75717), [#75725](https://github.com/PaddlePaddle/Paddle/pull/75725), [#75731](https://github.com/PaddlePaddle/Paddle/pull/75731), [#75798](https://github.com/PaddlePaddle/Paddle/pull/75798), [#75852](https://github.com/PaddlePaddle/Paddle/pull/75852), [#75856](https://github.com/PaddlePaddle/Paddle/pull/75856), [#75903](https://github.com/PaddlePaddle/Paddle/pull/75903), [#75909](https://github.com/PaddlePaddle/Paddle/pull/75909), [#75965](https://github.com/PaddlePaddle/Paddle/pull/75965), [#75987](https://github.com/PaddlePaddle/Paddle/pull/75987), [#75988](https://github.com/PaddlePaddle/Paddle/pull/75988), [#76107](https://github.com/PaddlePaddle/Paddle/pull/76107), [#76230](https://github.com/PaddlePaddle/Paddle/pull/76230), [#76278](https://github.com/PaddlePaddle/Paddle/pull/76278), [#76290](https://github.com/PaddlePaddle/Paddle/pull/76290), [#76303](https://github.com/PaddlePaddle/Paddle/pull/76303), [#76355](https://github.com/PaddlePaddle/Paddle/pull/76355), [#76363](https://github.com/PaddlePaddle/Paddle/pull/76363), [#76364](https://github.com/PaddlePaddle/Paddle/pull/76364), [#76368](https://github.com/PaddlePaddle/Paddle/pull/76368), [#76376](https://github.com/PaddlePaddle/Paddle/pull/76376), [#76393](https://github.com/PaddlePaddle/Paddle/pull/76393), [#76419](https://github.com/PaddlePaddle/Paddle/pull/76419), [#76453](https://github.com/PaddlePaddle/Paddle/pull/76453), [#76458](https://github.com/PaddlePaddle/Paddle/pull/76458), [#76466](https://github.com/PaddlePaddle/Paddle/pull/76466), [#76468](https://github.com/PaddlePaddle/Paddle/pull/76468), [#76480](https://github.com/PaddlePaddle/Paddle/pull/76480), [#76512](https://github.com/PaddlePaddle/Paddle/pull/76512), [#76587](https://github.com/PaddlePaddle/Paddle/pull/76587), [#76612](https://github.com/PaddlePaddle/Paddle/pull/76612), [#76694](https://github.com/PaddlePaddle/Paddle/pull/76694), [#76760](https://github.com/PaddlePaddle/Paddle/pull/76760), [#76797](https://github.com/PaddlePaddle/Paddle/pull/76797), [#76798](https://github.com/PaddlePaddle/Paddle/pull/76798), [#76834](https://github.com/PaddlePaddle/Paddle/pull/76834), [#76858](https://github.com/PaddlePaddle/Paddle/pull/76858), [#76867](https://github.com/PaddlePaddle/Paddle/pull/76867), [#76934](https://github.com/PaddlePaddle/Paddle/pull/76934), [#76937](https://github.com/PaddlePaddle/Paddle/pull/76937), [#76965](https://github.com/PaddlePaddle/Paddle/pull/76965), [#76980](https://github.com/PaddlePaddle/Paddle/pull/76980), [#77074](https://github.com/PaddlePaddle/Paddle/pull/77074), [#77076](https://github.com/PaddlePaddle/Paddle/pull/77076), [#77098](https://github.com/PaddlePaddle/Paddle/pull/77098), [#77140](https://github.com/PaddlePaddle/Paddle/pull/77140)
* 修复 FlashMaskV3 在 arch>=90 的编译问题。 [#76227](https://github.com/PaddlePaddle/Paddle/pull/76227)
* 修复 FlashMask LSE 存储中的 padding 与上游 FA2 的 LSE 存储格式不对齐的问题。[#76886](https://github.com/PaddlePaddle/Paddle/pull/76886)
* 修复 FlashMask 在部分序列长度下存在的读越界问题。[#76951](https://github.com/PaddlePaddle/Paddle/pull/76951)
* 修复 weight only linear pass 的 bug。 [#76097](https://github.com/PaddlePaddle/Paddle/pull/76097)

## 3. 用户体验升级
全面提升生态兼容性，支持在 paddle 框架中无缝衔接和使用外部生态算子，并推出 paddle.compat 系列 API，包含 paddle.compat.nn.Linear、paddle.compat.nn.MultiheadAttention 等近 10 个功能模块，降低飞桨模型接入外部生态模块的成本。优化动态图调试体验，新增前反向计算图可视化功能，构建了 API、Tensor 与 GradNode 的统一命名关联体系并支持导出调用栈与 MD5 校验信息，同时对关键路径日志进行治理并新增局部日志打印功能，特别针对 PyLayer 嵌套场景显著提升报错信息精准度与日志层级可读性，实现调试信息的丰富度和获取便捷性的提升。新增显存观测功能，直观展示显存池中各个内存块的分布情况，并支持追踪特定代码段的显存申请/释放与全局状态，实现对大模型显存异常的精准定位与高效治理。

**新特性**
* 新增动态图前反向计算图可视化功能。 [#75240](https://github.com/PaddlePaddle/Paddle/pull/75240), [#76032](https://github.com/PaddlePaddle/Paddle/pull/76032), [#76441](https://github.com/PaddlePaddle/Paddle/pull/76441)
* 新增局部日志打印功能，支持动态设置全局日志的级别，支持仅查看某段代码在前反向执行过程中输出的日志。[#75368](https://github.com/PaddlePaddle/Paddle/pull/75368), [#75590](https://github.com/PaddlePaddle/Paddle/pull/75590), [#76010](https://github.com/PaddlePaddle/Paddle/pull/76010), [#76685](https://github.com/PaddlePaddle/Paddle/pull/76685)
* 新增关键对象的统一命名体系，实现 API、Tensor、GradNode 的唯一命名和相互关联。[#75752](https://github.com/PaddlePaddle/Paddle/pull/75752)
* 支持导出前向 API 以及反向 GradNode 对应的 Python 调用栈。[#76143](https://github.com/PaddlePaddle/Paddle/pull/76143), [#75240](https://github.com/PaddlePaddle/Paddle/pull/75240)
* 支持导出所有 Tensor 的 md5 checksum。[#75835](https://github.com/PaddlePaddle/Paddle/pull/75835), [#76672](https://github.com/PaddlePaddle/Paddle/pull/76672)
* 支持 vmm Allocator 相关调试能力，支持通过 paddle.device.cuda.memory_summary 接口获取显存池状态，通过 paddle.device.cuda.allocate_record_guard 和 paddle.device.cuda.allocate_record_table 分析打印特定代码模块显存申请情况。[#76197](https://github.com/PaddlePaddle/Paddle/pull/76197), [#76349](https://github.com/PaddlePaddle/Paddle/pull/76349),[#76499](https://github.com/PaddlePaddle/Paddle/pull/76499), [#76554](https://github.com/PaddlePaddle/Paddle/pull/76554), [#76647](https://github.com/PaddlePaddle/Paddle/pull/76647), [#76716](https://github.com/PaddlePaddle/Paddle/pull/76716),  [#76812](https://github.com/PaddlePaddle/Paddle/pull/76812)
* 新增 paddle.compat 系列 API，包含 paddle.compat.equal，paddle.compat.nn.AvgPoolD，paddle.compat.nn.Linear，paddle.compat.nn.MultiheadAttention，paddle.compat.nn.Softmax，paddle.compat.nn.functional.sdpa，paddle.compat.seed，paddle.compat.slogdet，paddle.compat.unique 等。[#74697](https://github.com/PaddlePaddle/Paddle/pull/74697),[#76169](https://github.com/PaddlePaddle/Paddle/pull/76169),[#76275](https://github.com/PaddlePaddle/Paddle/pull/76275),[#76279](https://github.com/PaddlePaddle/Paddle/pull/76279),[#76471](https://github.com/PaddlePaddle/Paddle/pull/76471),[#76637](https://github.com/PaddlePaddle/Paddle/pull/76637),[#76446](https://github.com/PaddlePaddle/Paddle/pull/76446),[#76440](https://github.com/PaddlePaddle/Paddle/pull/76440),[#76387](https://github.com/PaddlePaddle/Paddle/pull/76387)
* 新增一系列 CUDA 相关 API，包括 paddle.cuda.cudart，paddle.cuda.get_stream_from_external，paddle.cuda.ipc_collect，paddle.cuda.Stream，paddle.device，paddle.version.cuda 等。[#75063](https://github.com/PaddlePaddle/Paddle/pull/75063), [#75089](https://github.com/PaddlePaddle/Paddle/pull/75089), [#75091](https://github.com/PaddlePaddle/Paddle/pull/75091), [#75108](https://github.com/PaddlePaddle/Paddle/pull/75108), [#75115](https://github.com/PaddlePaddle/Paddle/pull/75115), [#75153](https://github.com/PaddlePaddle/Paddle/pull/75153), [#75366](https://github.com/PaddlePaddle/Paddle/pull/75366), [#75435](https://github.com/PaddlePaddle/Paddle/pull/75435), [#75455](https://github.com/PaddlePaddle/Paddle/pull/75455), [#75744](https://github.com/PaddlePaddle/Paddle/pull/75744), [#76344](https://github.com/PaddlePaddle/Paddle/pull/76344)
* 针对一系列 API 支持多种签名，提升兼容性。[#75013](https://github.com/PaddlePaddle/Paddle/pull/75013), [#75027](https://github.com/PaddlePaddle/Paddle/pull/75027), [#75037](https://github.com/PaddlePaddle/Paddle/pull/75037), [#75043](https://github.com/PaddlePaddle/Paddle/pull/75043), [#75055](https://github.com/PaddlePaddle/Paddle/pull/75055), [#75063](https://github.com/PaddlePaddle/Paddle/pull/75063), [#75089](https://github.com/PaddlePaddle/Paddle/pull/75089), [#75091](https://github.com/PaddlePaddle/Paddle/pull/75091), [#75108](https://github.com/PaddlePaddle/Paddle/pull/75108), [#75115](https://github.com/PaddlePaddle/Paddle/pull/75115), [#75146](https://github.com/PaddlePaddle/Paddle/pull/75146), [#75153](https://github.com/PaddlePaddle/Paddle/pull/75153), [#75154](https://github.com/PaddlePaddle/Paddle/pull/75154), [#75174](https://github.com/PaddlePaddle/Paddle/pull/75174), [#75183](https://github.com/PaddlePaddle/Paddle/pull/75183), [#75206](https://github.com/PaddlePaddle/Paddle/pull/75206), [#75211](https://github.com/PaddlePaddle/Paddle/pull/75211), [#75298](https://github.com/PaddlePaddle/Paddle/pull/75298), [#75344](https://github.com/PaddlePaddle/Paddle/pull/75344), [#75366](https://github.com/PaddlePaddle/Paddle/pull/75366), [#75435](https://github.com/PaddlePaddle/Paddle/pull/75435), [#75455](https://github.com/PaddlePaddle/Paddle/pull/75455), [#75742](https://github.com/PaddlePaddle/Paddle/pull/75742), [#75744](https://github.com/PaddlePaddle/Paddle/pull/75744), [#76089](https://github.com/PaddlePaddle/Paddle/pull/76089), [#76132](https://github.com/PaddlePaddle/Paddle/pull/76132), [#76136](https://github.com/PaddlePaddle/Paddle/pull/76136), [#76149](https://github.com/PaddlePaddle/Paddle/pull/76149), [#76179](https://github.com/PaddlePaddle/Paddle/pull/76179), [#76189](https://github.com/PaddlePaddle/Paddle/pull/76189), [#76191](https://github.com/PaddlePaddle/Paddle/pull/76191), [#76206](https://github.com/PaddlePaddle/Paddle/pull/76206), [#76217](https://github.com/PaddlePaddle/Paddle/pull/76217), [#76221](https://github.com/PaddlePaddle/Paddle/pull/76221), [#76237](https://github.com/PaddlePaddle/Paddle/pull/76237), [#76254](https://github.com/PaddlePaddle/Paddle/pull/76254), [#76274](https://github.com/PaddlePaddle/Paddle/pull/76274), [#76285](https://github.com/PaddlePaddle/Paddle/pull/76285), [#76304](https://github.com/PaddlePaddle/Paddle/pull/76304), [#76344](https://github.com/PaddlePaddle/Paddle/pull/76344), [#76394](https://github.com/PaddlePaddle/Paddle/pull/76394), [#76439](https://github.com/PaddlePaddle/Paddle/pull/76439), [#76440](https://github.com/PaddlePaddle/Paddle/pull/76440), [#76446](https://github.com/PaddlePaddle/Paddle/pull/76446), [#76493](https://github.com/PaddlePaddle/Paddle/pull/76493), [#76522](https://github.com/PaddlePaddle/Paddle/pull/76522), [#76552](https://github.com/PaddlePaddle/Paddle/pull/76552), [#76580](https://github.com/PaddlePaddle/Paddle/pull/76580), [#76696](https://github.com/PaddlePaddle/Paddle/pull/76696), [#76754](https://github.com/PaddlePaddle/Paddle/pull/76754), [#76818](https://github.com/PaddlePaddle/Paddle/pull/76818)
* 新增调试和监控类相关 API，包含 set_vlog_level 等。[#75368](https://github.com/PaddlePaddle/Paddle/pull/75368), [#75500](https://github.com/PaddlePaddle/Paddle/pull/75500), [#76010](https://github.com/PaddlePaddle/Paddle/pull/76010), [#76450](https://github.com/PaddlePaddle/Paddle/pull/76450), [#76554](https://github.com/PaddlePaddle/Paddle/pull/76554), [#76647](https://github.com/PaddlePaddle/Paddle/pull/76647)

**功能增强**
* 系统治理动态图前反向关键路径的日志分级信息。[#75240](https://github.com/PaddlePaddle/Paddle/pull/75240)
* 优化 PyLayer 嵌套场景下的报错信息，支持报错直接显示前向栈、GradNode name 信息，不同层 Pylayer 通过缩进区分报错信息。[#76219](https://github.com/PaddlePaddle/Paddle/pull/76219), [#76450](https://github.com/PaddlePaddle/Paddle/pull/76450)
* 日志和调试信息优化：GLOG info 管理、支持导出 forward API Python 调用栈等。[#75240](https://github.com/PaddlePaddle/Paddle/pull/75240), [#75888](https://github.com/PaddlePaddle/Paddle/pull/75888), [#76010](https://github.com/PaddlePaddle/Paddle/pull/76010), [#76143](https://github.com/PaddlePaddle/Paddle/pull/76143),[#76450](https://github.com/PaddlePaddle/Paddle/pull/76450),[#76219](https://github.com/PaddlePaddle/Paddle/pull/76219)
* 优化自定义算子调试。[#76603](https://github.com/PaddlePaddle/Paddle/pull/76603)
* 优化数据类型接口。[#75096](https://github.com/PaddlePaddle/Paddle/pull/75096),[#75427](https://github.com/PaddlePaddle/Paddle/pull/75427)
* 文档和示例代码优化。[#74603](https://github.com/PaddlePaddle/Paddle/pull/74603), [#75232](https://github.com/PaddlePaddle/Paddle/pull/75232), [#75233](https://github.com/PaddlePaddle/Paddle/pull/75233), [#75234](https://github.com/PaddlePaddle/Paddle/pull/75234), [#75527](https://github.com/PaddlePaddle/Paddle/pull/75527), [#75594](https://github.com/PaddlePaddle/Paddle/pull/75594), [#76188](https://github.com/PaddlePaddle/Paddle/pull/76188), [#76350](https://github.com/PaddlePaddle/Paddle/pull/76350), [#76435](https://github.com/PaddlePaddle/Paddle/pull/76435), [#76542](https://github.com/PaddlePaddle/Paddle/pull/76542), [#76563](https://github.com/PaddlePaddle/Paddle/pull/76563), [#76574](https://github.com/PaddlePaddle/Paddle/pull/76574), [#76617](https://github.com/PaddlePaddle/Paddle/pull/76617), [#76689](https://github.com/PaddlePaddle/Paddle/pull/76689), [#76691](https://github.com/PaddlePaddle/Paddle/pull/76691), [#76698](https://github.com/PaddlePaddle/Paddle/pull/76698), [#76721](https://github.com/PaddlePaddle/Paddle/pull/76721), [#76750](https://github.com/PaddlePaddle/Paddle/pull/76750), [#76926](https://github.com/PaddlePaddle/Paddle/pull/76926)
* 兼容 floor_divide 与 masked_select 行为差异 [#75148](https://github.com/PaddlePaddle/Paddle/pull/75148)

**Bug 修复**
* 修复 paddle.Tensor 构造时的 place 相关问题。[#75017](https://github.com/PaddlePaddle/Paddle/pull/75017)
* 修复装饰器中缺失泛型参数导致类型传播中断的问题。[#75162](https://github.com/PaddlePaddle/Paddle/pull/75162)
* 修复动态 shape 处理中除法操作问题。[#75526](https://github.com/PaddlePaddle/Paddle/pull/75526)
* 修复设备类 API、model.to(device=tensor.place)相关问题。[#75308](https://github.com/PaddlePaddle/Paddle/pull/75308), [#75530](https://github.com/PaddlePaddle/Paddle/pull/75530), [#75867](https://github.com/PaddlePaddle/Paddle/pull/75867)
* 修复 Tensor.__eq__和 Tensor.__ne__对不支持类型的处理。[#76118](https://github.com/PaddlePaddle/Paddle/pull/76118)
* 修复 to_sparse_coo，to_sparse_csr 问题。[#76076](https://github.com/PaddlePaddle/Paddle/pull/76076)
* 修复 CTCLoss 的 zero_infinity 参数位置和文档。[#76156](https://github.com/PaddlePaddle/Paddle/pull/76156),[#76188](https://github.com/PaddlePaddle/Paddle/pull/76188)
* 修复 Embedding 的 weight[padding_idx]初始化问题。[#76204](https://github.com/PaddlePaddle/Paddle/pull/76204)
* 修复 XPU pin_memory 问题。[#76147](https://github.com/PaddlePaddle/Paddle/pull/76147)
* 修复张量打印问题。[#76380](https://github.com/PaddlePaddle/Paddle/pull/76380)
* 修复 Windows 编译错误。[#76322](https://github.com/PaddlePaddle/Paddle/pull/76322)
* 修复 paddle.nn.Parameter 的 kwargs 参数传入问题。[#76476](https://github.com/PaddlePaddle/Paddle/pull/76476)
* 修复 Layer.zero_grad，normalize，uniform_，unfold 问题。[#76494](https://github.com/PaddlePaddle/Paddle/pull/76494), [#76600](https://github.com/PaddlePaddle/Paddle/pull/76600)
* 修复窗函数的 dtype 相关问题。[#76623](https://github.com/PaddlePaddle/Paddle/pull/76623)
* 修复 view 梯度在 out_grad 不连续时的问题。[#76679](https://github.com/PaddlePaddle/Paddle/pull/76679)
* 修复 custom_device 中的.cuda 问题。[#76755](https://github.com/PaddlePaddle/Paddle/pull/76755)
* 修复 SDPA 在分布式上运行的错误。[#76782](https://github.com/PaddlePaddle/Paddle/pull/76782)
* 修复 paddle.compat.equal 问题。[#76658](https://github.com/PaddlePaddle/Paddle/pull/76658)
* 修复 MultiHeadAttention 中 GQA 中的 head 为 0 问题。[#76761](https://github.com/PaddlePaddle/Paddle/pull/76761)
* 修复 Dy2St 因字典顺序触发的重新编译问题。[#76944](https://github.com/PaddlePaddle/Paddle/pull/76944)
* 修复 SDPA 的 attn_mask 扩展问题和变量名问题。[#76959](https://github.com/PaddlePaddle/Paddle/pull/76959)
* 修复 Tensor.data 相关问题。[#76818](https://github.com/PaddlePaddle/Paddle/pull/76818), [#76933](https://github.com/PaddlePaddle/Paddle/pull/76933)

## 4. 国产硬件适配
对 XPU 算子能力进行了系统性增强，针对 MoE 相关算子新增 bool、bfloat16、complex64 等数据类型支持，并完善 FlashAttention、DeepEP、Profiler 等关键模块，显著提升了大模型训练场景的昆仑芯适配能力。此外，通过引入 Hygon 数学库后端，进一步优化了海光 DCU 芯片推理性能。

**新特性**
* 系统增强 XPU 硬件上 assign，assign_value，concat，fill_any，multiply，scatter_nd 等关键算子在 bool、bf16、complex64 等数据类型的支持能力。[#75249](https://github.com/PaddlePaddle/Paddle/pull/75249), [#76893](https://github.com/PaddlePaddle/Paddle/pull/76893), [#76903](https://github.com/PaddlePaddle/Paddle/pull/76903), [#76904](https://github.com/PaddlePaddle/Paddle/pull/76904), [#76912](https://github.com/PaddlePaddle/Paddle/pull/76912)
* 新增 XPU 硬件的 index_elementwise_get 和 masked_fill 算子支持。
* aarch64 架构支持 XCCL。[#75797](https://github.com/PaddlePaddle/Paddle/pull/75797)
* 支持 XPU DeepEP。[#76284](https://github.com/PaddlePaddle/Paddle/pull/76284), [#76362](https://github.com/PaddlePaddle/Paddle/pull/76362), [#76594](https://github.com/PaddlePaddle/Paddle/pull/76594), [#76869](https://github.com/PaddlePaddle/Paddle/pull/76869)
* 支持 XPU moe_gate_dispatch 算子。[#75230](https://github.com/PaddlePaddle/Paddle/pull/75230)
* 自定义设备支持 mixvector。[#75182](https://github.com/PaddlePaddle/Paddle/pull/75182)

**功能增强**
* profiler 功能支持 XPU Nvtx 和 CUPTI 事件采集。[#76385](https://github.com/PaddlePaddle/Paddle/pull/76385)
* FlashAttention 重构，支持 float16 的 fa_taccum。[#76737](https://github.com/PaddlePaddle/Paddle/pull/76737)
* 升级 XHPC 至 20251014 版本。[#75872](https://github.com/PaddlePaddle/Paddle/pull/75872)
* pool2d 和 pool2d_grad 算子底层升级 xpudnn 实现。
* CUDAExtension 支持 Custom Device。[#76876](https://github.com/PaddlePaddle/Paddle/pull/76876)
* 将 asin 升级为 arcsin 以支持 NumPy 1.x。[#76485](https://github.com/PaddlePaddle/Paddle/pull/76485)

**性能优化**
* 引入 Hygon 数学库后端以提升海光设备推理性能。[#76266](https://github.com/PaddlePaddle/Paddle/pull/76266)

**Bug 修复**
* 修复 pinned memory 在 Custom Device 上的误用问题。[#75593](https://github.com/PaddlePaddle/Paddle/pull/75593)
* 修复 arange，beam_search_decode，binomial_kernel，combine，dispatch，expand，fused_layernorm，mask_select，mp_allreduce_sum，multiclass_nms3，nonzero，psroi_pool_grad，quantize_linear，top_p_sampling 等一系列算子问题。[#75532](https://github.com/PaddlePaddle/Paddle/pull/75532), [#75938](https://github.com/PaddlePaddle/Paddle/pull/75938), [#76238](https://github.com/PaddlePaddle/Paddle/pull/76238), [#76487](https://github.com/PaddlePaddle/Paddle/pull/76487), [#76547](https://github.com/PaddlePaddle/Paddle/pull/76547), [#76548](https://github.com/PaddlePaddle/Paddle/pull/76548), [#76561](https://github.com/PaddlePaddle/Paddle/pull/76561), [#76651](https://github.com/PaddlePaddle/Paddle/pull/76651), [#76666](https://github.com/PaddlePaddle/Paddle/pull/76666), [#76690](https://github.com/PaddlePaddle/Paddle/pull/76690), [#76792](https://github.com/PaddlePaddle/Paddle/pull/76792), [#76901](https://github.com/PaddlePaddle/Paddle/pull/76901), [#76906](https://github.com/PaddlePaddle/Paddle/pull/76906), [#77025](https://github.com/PaddlePaddle/Paddle/pull/77025)
* 修复 iluvatar_gpu 和 metax_gpu 相关编译问题。[#75969](https://github.com/PaddlePaddle/Paddle/pull/75969)
* 更新 xhpc 以修复 strided_copy 参数检查问题。[#76213](https://github.com/PaddlePaddle/Paddle/pull/76213)
* 修复在启用编译选项-DWITH_XPTI=ON 时的构建失败问题。[#76577](https://github.com/PaddlePaddle/Paddle/pull/76577)
* 修复 XPU 下 AddGradKernel，SoftmaxWithCrossEntropy，view 等算子问题。 [#76631](https://github.com/PaddlePaddle/Paddle/pull/76631)

## 5. 贡献者名单
[ADchampion3](https://github.com/ADchampion3), [ALGO1832](https://github.com/algorithm1832), [AlAuAu](https://github.com/AlAuAu), [Android zhang](https://github.com/zade23), [Ayakouji](https://github.com/aquagull), [Bvicii](https://github.com/scyyh11), [Chang Lu](https://github.com/AndSonder), [Chen Zhiyang](https://github.com/changeyoung98), [Difer](https://github.com/Difers), [Echo-Nie](https://github.com/Echo-Nie), [Eddie-Wang](https://github.com/Eddie-Wang1120), [Fang Chengjie](https://github.com/wanglezz), [Felix Schladt](https://github.com/FelixSchladt), [Frank Lee](https://github.com/Forest-Lee), [GoldPancake](https://github.com/Deleter-D), [Gu Shiwei](https://github.com/swgu98), [HU Shenwei](https://github.com/hushenwei2000), [Haco](https://github.com/xiaohajiayou), [Hammer](https://github.com/hd9568), [Haonan Luo](https://github.com/Limerances), [Haze188 灏喆](https://github.com/Hz188), [HydrogenSulfate](https://github.com/HydrogenSulfate), [Jia Ningyu](https://github.com/jianingyu-ustc), [Jianbang Yang](https://github.com/dynamicheart), [Jiaxin Sui](https://github.com/plusNew001), [Jingzong Liu](https://github.com/tjujingzong), [Kunbo Ding](https://github.com/KB-Ding), [LLSGYN](https://github.com/LLSGYN), [Leo Guo](https://github.com/ZibinGuo), [LiaoYFBH](https://github.com/LiaoYFBH), [LixinGuo](https://github.com/cs2be), [Lucas](https://github.com/cqulilujia), [Luckycheng222](https://github.com/Luckycheng222), [Ma Xiaolong](https://github.com/maxiaolong001), [MayYouBeProsperous](https://github.com/MayYouBeProsperous), [MingkunZhang](https://github.com/StareAtYou), [Nyakku Shigure](https://github.com/SigureMo), [Pan Zhichen](https://github.com/pzc2004), [Patrisam](https://github.com/Patrisam), [Qianyue He](https://github.com/Enigmatisms), [Ricardo-shuo-liu](https://github.com/Ricardo-shuo-liu), [Rockway](https://github.com/LingmaFuture), [Ruibiao Chen](https://github.com/From00), [Runming Xie](https://github.com/youge325), [Ryan](https://github.com/DrRyanHuang), [SUN Dong](https://github.com/DanielSun11), [SZTULDH](https://github.com/SZTULDH), [Shuhao Liang](https://github.com/lshpku), [SidusAntares](https://github.com/SidusAntares), [Sunday019](https://github.com/Sunday019), [Sunny-bot1](https://github.com/Sunny-bot1), [Tao Luo](https://github.com/wojtuss), [Tianyu Zheng](https://github.com/zty-king), [Tofu](https://github.com/WHoutstanding), [Wang Jiabao](https://github.com/SpongeBob0318), [Wenfei (Charles) Qi](https://github.com/Manfredss), [Wennie396](https://github.com/Wennie396), [Xiangrui Yu](https://github.com/xxyux), [XiangzheWang](https://github.com/Waynezee), [XiaoguangHu](https://github.com/XiaoguangHu01), [YUNSHEN XIE](https://github.com/XieYunshen), [Yami](https://github.com/Le-soleile), [Yiqun Liu](https://github.com/Xreki), [Yohanna](https://github.com/YuhanXu), [Yuan Xiaolan](https://github.com/rsmallblue), [YuanRisheng](https://github.com/YuanRisheng), [Yuang Liu](https://github.com/FeixLiu), [Yufei Liao](https://github.com/LiaoYFBH), [Yuntao Nie](https://github.com/GITD245), [Yuqiang Ge](https://github.com/YqGe585), [Yutian Rao](https://github.com/raoyutian), [Z784555](https://github.com/Z784555), [Zero Rains](https://github.com/zeroRains), [Zhan Rongrui](https://github.com/zrr1999), [Zhang Ting](https://github.com/zhangting2020), [Zhaowu Pan](https://github.com/A-nnonymous), [ZhenxingLi](https://github.com/Xing-lil), [Zhou Xin](https://github.com/LittleHeroZZZX), [ZhouDuan](https://github.com/1184319564), [ZhouMinhao98](https://github.com/ZhouMinhao98), [Zx](https://github.com/ZhangX-21), [baiyue](https://github.com/DongBaiYue), [baoqiwen](https://github.com/baoqiwen), [bigwhite37](https://github.com/bigwhite37), [blacksheep-Aristotle](https://github.com/blacksheep-Aristotle), [bukejiyu](https://github.com/bukejiyu), [chen](https://github.com/ckl117), [chen2016013](https://github.com/chen2016013), [co63oc](https://github.com/co63oc), [cyberslack_lee](https://github.com/enkilee), [cyy536](https://github.com/cyy536), [dakelong](https://github.com/dakelong), [ddchenhao66](https://github.com/ddchenhao66), [dongzezhao](https://github.com/D0m021ng), [fangfangssj](https://github.com/fangfangssj), [fanhaoxuee](https://github.com/ApricityXX), [feri](https://github.com/feixi21), [fxyfxy777](https://github.com/fxyfxy777), [gouzil](https://github.com/gouzil), [huangjiyi](https://github.com/huangjiyi), [ice](https://github.com/aztice), [jzhang533](https://github.com/jzhang533), [lijialin03](https://github.com/lijialin03), [lijin23](https://github.com/lj970926), [liufengwei0103](https://github.com/liufengwei0103), [liuruyan](https://github.com/liuruyan), [lzy](https://github.com/heavyrain-lzy), [megemini](https://github.com/megemini), [mikethegoblin](https://github.com/mikethegoblin), [ooo oo](https://github.com/ooooo-create), [paddle-xpu-bot](https://github.com/paddle-xpu-bot), [qjyyy77](https://github.com/qjyyy77), [qw86972190](https://github.com/qw86972190), [sneaxiy](https://github.com/sneaxiy), [starcrown001](https://github.com/starcrown001), [tianhaodongbd](https://github.com/tianhaodongbd), [tianlef](https://github.com/tianlef), [tianshuo78520a](https://github.com/tianshuo78520a), [umiswing](https://github.com/umiswing), [waliwali777](https://github.com/waliwali777), [wanghuancoder](https://github.com/wanghuancoder), [wanrui](https://github.com/WanRui37), [xiaoguoguo626807](https://github.com/xiaoguoguo626807), [xingmingyyj](https://github.com/xingmingyyj), [xinruiM](https://github.com/xinruiM), [xuanyuanminzheng](https://github.com/xuanyuanminzheng), [xxiu1](https://github.com/xxiu1), [yangjianfengo1](https://github.com/yangjianfengo1), [yongqiangma](https://github.com/yongqiangma), [yunruoxi](https://github.com/fsylmxx), [zccjjj](https://github.com/zccjjj), [zhangbo9674](https://github.com/zhangbo9674), [zhanghonggeng](https://github.com/zhanghonggeng), [zhangyikun02](https://github.com/zhangyk0314), [zhangyuqin1998](https://github.com/zhangyuqin1998), [zhengshengning](https://github.com/zhengshengning), [zhupengyang](https://github.com/zhupengyang), [zhwesky2010](https://github.com/zhwesky2010), [zoeczy](https://github.com/zoeczy), [zyfncg](https://github.com/zyfncg), [zzm](https://github.com/zhiminzhang0830), [周周周](https://github.com/zhoutianzi666), [苍天荒](https://github.com/cangtianhuang), [椿下湫咚](https://github.com/L-CXQD), [正在学习](https://github.com/cszdrg), [学习中的牛马](https://github.com/Dayuxiaoshui)
