pytorch 网络模型移植训练指南 · 2020. 11. 5. · frameworkptadapter v100r020c10...

FrameworkPTAdapterV100R020C10

PyTorch 网络模型移植&训练指南

文档版本 01

发布日期 2021-01-25

华为技术有限公司

版权所有 © 华为技术有限公司 2021。保留一切权利。

非经本公司书面许可，任何单位和个人不得擅自摘抄、复制本文档内容的部分或全部，并不得以任何形式传播。商标声明

和其他华为商标均为华为技术有限公司的商标。本文档提及的其他所有商标或注册商标，由各自的所有人拥有。注意

您购买的产品、服务或特性等应受华为公司商业合同和条款的约束，本文档中描述的全部或部分产品、服务或特性可能不在您的购买或使用范围之内。除非合同另有约定，华为公司对本文档内容不做任何明示或默示的声明或保证。

由于产品版本升级或其他原因，本文档内容会不定期进行更新。除非另有约定，本文档仅作为使用指导，本文档中的所有陈述、信息和建议不构成任何明示或暗示的担保。

文档版本 01 (2021-01-25) 版权所有 © 华为技术有限公司 i

目录

1 概述..............................................................................................................................................1

2 安装框架及混合精度模块...........................................................................................................42.1 获取软件包................................................................................................................................................................................. 42.2 安装 PyTorch 框架.................................................................................................................................................................... 52.3 安装混合精度模块.................................................................................................................................................................... 5

3 约束与限制.................................................................................................................................. 7

4 网络模型迁移要点介绍...............................................................................................................84.1 接口替换......................................................................................................................................................................................8

5 分布式训练介绍........................................................................................................................ 115.1 概述............................................................................................................................................................................................ 115.2 典型场景................................................................................................................................................................................... 125.2.1 Server 单机........................................................................................................................................................................... 125.2.2 Server 集群........................................................................................................................................................................... 125.2.3 Atlas 300T 训练卡（型号 9000）................................................................................................................................. 145.3 通过 Allreduce 架构进行分布式训练................................................................................................................................ 145.4 集合通信接口使用指导......................................................................................................................................................... 155.4.1 集合通信初始化................................................................................................................................................................... 165.4.2 集合通信接口....................................................................................................................................................................... 16

6 专题介绍....................................................................................................................................176.1 混合精度模块 Apex................................................................................................................................................................ 176.2 性能优化................................................................................................................................................................................... 186.2.1 概述........................................................................................................................................................................................ 186.2.2 修改 CPU 性能模式.............................................................................................................................................................186.2.3 安装高性能 pillow 库......................................................................................................................................................... 21

7 脚本执行....................................................................................................................................237.1 环境变量配置说明.................................................................................................................................................................. 23

8 基于 imagenet 数据集的 ResNet50 模型训练示例............................................................. 258.1 样例获取................................................................................................................................................................................... 258.2 网络迁移................................................................................................................................................................................... 258.2.1 单 P 训练修改....................................................................................................................................................................... 26

FrameworkPTAdapterPyTorch 网络模型移植&训练指南目录

文档版本 01 (2021-01-25) 版权所有 © 华为技术有限公司 ii

8.2.2 分布式训练修改................................................................................................................................................................... 278.3 脚本执行................................................................................................................................................................................... 32

9 FAQ........................................................................................................................................... 339.1 pip3.7 install Pillow==5.3.0 安装失败............................................................................................................................. 339.2 安装“torch-*.whl ”提示“torch 1.5.0xxxx”与“torchvision”所依赖的版本不匹配.................................... 33

A 修订记录................................................................................................................................... 35

FrameworkPTAdapterPyTorch 网络模型移植&训练指南目录

文档版本 01 (2021-01-25) 版权所有 © 华为技术有限公司 iii

1 概述当前阶段实现的对接适配昇腾AI处理器的方案为在线对接方案。

方案特性及优点

昇腾AI处理器的加速实现方式是以各种算子为粒度进行调用（OP-based），即通过AscendCL调用一个或几个D亲和算子组合的形式，代替原有GPU的实现方式。其逻辑模型如图1-1所示。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 1 概述

文档版本 01 (2021-01-25) 版权所有 © 华为技术有限公司 1

图 1-1 逻辑模型

当前选择在线对接适配方案的主要原因有一下几点：

1. 最大限度的继承PyTorch框架动态图的特性。



2. 最大限度的继承GPU在PyTorch上的使用方式，可以使用户在移植到昇腾AI处理器设备上时，在开发方式和代码重用方面做到最小的改动。

3. 最大限度的继承PyTorch原生的体系结构，保留框架本身出色的特性，比如自动微分、动态分发、Debug、Profiling、Storage共享机制以及设备侧的动态内存管理等。

4. 扩展性好。在打通流程的通路之上，对于新增的网络类型或结构，只需涉及相关计算类算子的开发和实现。框架类算子，反向图建立和实现机制等结构可保持复用。

5. 与GPU的使用方式和风格保持一致。用户在使用在线对接方案时，只需在Python侧和Device相关操作中，指定device为昇腾AI处理器，即可完成用昇腾AI处理器在PyTorch对网络的开发、训练以及调试，用户无需额外进一步关注昇腾AI处理器具体的底层细节。这样可以确保用户的最小化修改及完成平台迁移，迁移成本较低。



2 安装框架及混合精度模块2.1 获取软件包

2.2 安装PyTorch框架

2.3 安装混合精度模块

2.1 获取软件包

前提条件● 已完成CANN运行环境的安装，具体操作请参考《CANN 软件安装指南 (开发&运

行场景, 通过命令行方式)》。● 仅支持在昇腾设备上安装与运行torch-*.whl及混合精度模块Apex。

获取软件包

软件安装前，请根据下表获取对应的软件包。

表 2-1 获取软件包

软件包类型软件包名称获取链接

深度学习框架

X86架构：torch-1.5.0+ascend-cp37-cp37m-linux_x86_64.whl

获取链接

ARM架构：torch-1.5.0+ascend-cp37-cp37m-linux_aarch64.whl

混合精度模块

X86架构：apex-0.1+ascend-cp37-cp37m-linux_x86_64.whl

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 2 安装框架及混合精度模块


https://support.huaweicloud.com/instg-cli-cann/atlascli_03_0001.htmlhttps://support.huaweicloud.com/instg-cli-cann/atlascli_03_0001.htmlhttps://ascend.huawei.com/#/software/ai-frameworks

软件包类型软件包名称获取链接

ARM架构：apex-0.1+ascend-cp37-cp37m-linux_aarch64.whl

2.2 安装 PyTorch 框架

安装流程

步骤1 以root或非root登录服务器。

步骤2 依次执行如下命令安装PyTorch依赖环境。

如果使用非root用户安装Python及其依赖，用户需要在本步骤中的每句命令结尾加上--user，命令示例为：pip3.7.5 install pyyaml --user

pip3.7 install pyyamlpip3.7 install wheelpip3.7 install Pillow==5.3.0

若以上过程报错，请参考FAQ尝试解决问题。

步骤3 安装torchvision依赖，如果使用非root用户安装，则需在命令末尾加上--user。

当服务器运行环境为X86架构时，安装命令如下：

pip3.7 install torchvision==0.6.0 --no-deps

当服务器运行环境为ARM架构时，安装命令如下：

pip3.7 install torchvision --no-deps

步骤4 安装PyTorch。

1. 拷贝 torch-*.whl 至目标运行服务器(若本身就在运行服务器进行PyTorch编译，则无需此动作)，使用pip命令完成深度学习框架PyTorch的安装。

2. 进入torch-*.whl 所在的目录。cd torch-*.whl所在的目录

3. 安装torch-*.whl（以X86架构软件包为例）。– 使用root用户安装，执行如下命令：

pip3.7 install --upgrade torch-1.5.0+ascend-cp37-cp37m-linux_x86_64.whl

– 使用普通用户安装，执行如下命令：pip3.7 install --upgrade torch-1.5.0+ascend-cp37-cp37m-linux_x86_64.whl --user

----结束

2.3 安装混合精度模块

安装流程

步骤1 请确保运行环境中适配昇腾AI处理器的PyTorch框架能正常使用。



步骤2 将获取到的混合精度模块安装包apex-0.1+ascend-cp37-cp37m-linux_{arch}.whl上传到运行环境。

步骤3 安装混合精度模块。

1. 进入apex-0.1+ascend-cp37-cp37m-linux_{arch}.whl所在的目录。cd apex-0.1+ascend-cp37-cp37m-linux_{arch}.whl所在的目录

2. 安装apex-0.1+ascend-cp37-cp37m-linux_{arch}.whl（以X86架构软件包为例）。

– 使用root用户安装，执行如下命令：pip3.7 install apex-0.1+ascend-cp37-cp37m-linux_x86_64.whl

– 使用普通用户安装，执行如下命令：pip3.7 install apex-0.1+ascend-cp37-cp37m-linux_x86_64.whl --user

----结束



3 约束与限制1. infershape阶段算子不支持unknowshape的推导。2. cube计算的算子只支持fp16。3. 不支持AICPU。4. 不支持inf/nan类型的输入。5. 不支持int64。6. 出现4D以上的format时不能降维。7. Apex当前版本的实现方式为python实现，不支持类似cuda的优化。8. 集合通信约束：

– 数据并行模式中不同device上执行的图相同。– 只支持1/2/4/8P粒度的分配。– 只支持int8，int32，float16和float32数据类型。– 服务器网卡名称要求以eth开头。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 3 约束与限制


4 网络模型迁移要点介绍4.1 接口替换

4.1 接口替换为了使昇腾AI处理器使用PyTorch框架的能力，需要对原生的PyTorch框架进行一定Device层面的适配，对外呈现是需要将跟cpu和cuda相关的接口进行切换；

在进行网络迁移时，需要将某些设备相关的接口转换成跟npu相关的接口，当前适配的设备相关接口参见：

表 4-1 接口替换

PyTorch原始接口适配昇腾芯片后的接口说明

torch.cuda.is_available() torch.npu.is_available() 判断当前环境上设备是否可用（不代表最后结果）。

torch.cuda.current_device() torch.npu.current_device() 获取当前正在使用的device。

torch.cuda.device_count() torch.npu.device_count() 获取当前环境上的设备数量。

torch.cuda.set_device() torch.npu.set_device() 设置当前正在使用的device。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 4 网络模型迁移要点介绍



torch.cuda.synchronize() torch.npu.synchronize() 同步等待事件完成。

torch.cuda.device torch.npu.device 生成一个device类，可以执行device相关操作。

torch.cuda.Stream(device) torch.npu.Stream(device) 生成一个stream对象。

torch.cuda.stream(Stream) torch.npu.stream(Stream) 多用于作用域限定。

torch.cuda.current_stream() torch.npu.current_stream() 获取当前stream。

torch.cuda.default_stream() torch.npu.default_stream() 获取默认stream。

torch.tensor([1,2,3]).is_cuda torch.tensor([1,2,3]).is_npu 判断某个tensor是否是cuda/npu设备上的格式。

torch.tensor([1,2,3]).cuda() torch.tensor([1,2,3]).npu() 将某个tensor转换成cuda/npu格式。

torch.tensor([1,2,3]).to("cuda") torch.tensor([1,2,3]).to('npu') 将某个tensor转换成cuda/npu格式。

device = torch.device("cuda:0") device = torch.device("npu:0") 指定一个设备。

torch.autograd.profiler.profile(use_cuda=True)

torch.autograd.profiler.profile(use_npu=True)

指定执行profiler过程中使用cuda/npu。




torch.cuda.Event() torch.npu.Event() 返回某个设备上的事件。



5 分布式训练介绍5.1 概述5.2 典型场景5.3 通过Allreduce架构进行分布式训练5.4 集合通信接口使用指导

5.1 概述在大规模AI训练集群中，通常采用数据并行的方式完成训练。数据并行即每个设备使用相同的模型、不同的训练样本，每个device计算得到的梯度数据需要聚合之后进行参数更新。

图 5-1 数据并行方式训练的示意图

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 5 分布式训练介绍


如果按照梯度聚合方式进行分类，数据并行的主流实现有PS-workers架构和AllReduce集合通信两种，Ascend平台两种都支持，详细使用说明请参考通过Allreduce架构进行分布式训练。

5.2 典型场景

5.2.1 Server 单机Server单机场景，即由1台训练服务器（Server）完成训练，每台Server包含8块芯片（即昇腾AI处理器）。其中参与集合通信的芯片数目只能为1/2/4/8，且0-3卡和4-7卡各为一个组网，使用2张卡或4张卡训练时，不支持跨组网创建设备集群。

图 5-2 单机训练

说明

Pytorch使用自动拓扑探测方式进行HCCL初始化时，只有0和1卡、2和3卡、4和5卡、6和7卡才能组成2P集群；进行4P训练时，只有0-3卡、4-7卡才能组成4P集群。

5.2.2 Server 集群

典型集群组网

Server集群场景，即由集群管理主节点+一组训练服务器（Server）组成训练服务器集群，Server当前支持的上限是128台。每台Server上包含8块芯片（即昇腾AI处理器），Server集群场景下，参与集合通信的的芯片数目为8*n（其中n为参与训练的Server个数），n为2的指数倍情况下，集群性能最好，建议用户优先采用此种方式进行集群组网。



图 5-3 集群训练

说明

集群管理主节点支持集群及集群内设备的管理能力，同时支持整个集群内的分布式作业管理。

分布式训练执行流程

训练作业经过集群管理主节点下发到训练服务器，由服务器上的作业Agent根据App指定的设备数量，启动相应个数的训练进程执行训练，一个进程和一个昇腾AI处理器对应。

网口裁剪场景

正常情况下，每个Server通过8个直出网口实现Server间的集合通信，而某些情况下，每个Server仅使用1/2/4个网口实现Server间的集合通信。网口裁剪场景下：

● 各Server从相同位置的device出1/2/4个网口组合成训练集群。推荐网口选择：

– 1网口场景下：选择各网口的性能一致。

– 2网口场景下：性能最优选择[0, 5]或[1, 4]或[2, 7]或[3, 6] 。

– 4网口场景下：性能最优选择[0, 2, 5, 7]或[1, 3, 4, 6]。

● 通过节点内和节点间的集合通信操作实现整个训练集群的broadcast/allreduce/reduce_scatter/allgather，其中节点间的数据通信只能通过可用的1/2/4个网口进行传输。

● 参与集合通信的的芯片数目必须是8*n（其中n为参与训练的Server个数），不支持1*n/2*n/4*n。



5.2.3 Atlas 300T 训练卡（型号 9000）当前训练卡场景支持单机单卡训练，以及多机多卡分布式训练场景。一张训练卡内含一片昇腾AI处理器。

典型组网

多机分布式训练可以使用训练卡自出的100G网口进行Server间的传输，采用Ring +Halving-doubling算法完成集合通信功能。

图 5-4 组网图

注意事项说明

1. 不同Server的训练卡数量需要保持一致。2. 整个组网中各训练卡的网卡IP配置在同一网段。3. 当前仅支持allreduce/broadcast 。4. 训练之前，需要通过环境变量HCCL_INTRA_PCIE_ENABLE和

HCCL_INTRA_ROCE_ENABLE设置多卡间通信方式，默认使用PCIe环路，推荐使用RoCE环路。

5.3 通过 Allreduce 架构进行分布式训练

概述

在Allreduce架构中，每个参与训练的Device形成一个环，如图5-5所示，没有中心节点来聚合所有计算梯度。AllReduce算法将参与训练的Device放置在一个逻辑环路（logical ring）中。每个Device从上行的Device接收数据，并向下行的Device发送数据，可充分利用每个Device的上下行带宽。



图 5-5 Allreduce 架构

使用 DistributedDataParallel 进行分布式训练

PyTorch通过DistributedDataParallel方式来进行分布式训练，即在模型初始化阶段执行init_process_group，再将模型初始化为DistributedDataParallel模型。

PyTorch分布式训练代码示例（部分代码省略）：

import torchimport torch.nn as nnimport torch.nn.parallelimport torch.backends.cudnn as cudnnimport torch.distributed as distimport torch.optimimport torch.multiprocessing as mpimport torch.utils.dataimport torch.utils.data.distributedimport torchvision.transforms as transformsimport torchvision.datasets as datasetsimport torchvision.models as models

def main(): args = parser.parse_args() dist.init_process_group(backend=args.dist_backend, world_size=args.world_size, rank=args.rank) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.workers, pin_memory=True, sampler=train_sampler) for epoch in range(args.start_epoch, args.epochs): acc1 = train(train_loader, model, criterion, optimizer, epoch, args,ngpus_per_node, lr_scheduler)

5.4 集合通信接口使用指导



5.4.1 集合通信初始化PyTorch中集合通信通过torch.nn.parallel.DistributedDataParallel类初始化，如下代码所示，首先初始化process_group，配置后端为hccl，以及world_size等，之后将模型实例化为torch.nn.parallel.DistributedDataParallel模型。

Example:: >>> dist.init_process_group(backend=args.dist_backend, world_size=args.world_size, rank=args.rank) >>> model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

5.4.2 集合通信接口调用算子原型接口对要进行集合通信计算的tensor进行计算。

allreduceallreduce：提供group内的集合通信allreduce功能，对所有节点的同名张量进行reduce操作，reduce操作由reduction参数指定。

broadcastbroadcast：提供group内的集合通信broadcast功能，将root节点的数据广播到其他rank。



6 专题介绍6.1 混合精度模块Apex

6.2 性能优化

6.1 混合精度模块 Apex

概述

基于NPU芯片的架构特性，会涉及到混合精度训练，即混合使用float16和float32数据类型的应用场景。使用float16代替float32有如下一些好处：

● 对于中间变量的内存占用更少，节省内存的使用。● 因内存使用会减少，所以数据传出的时间也会减半。● float16的计算单元可以提供更快的计算性能。

但是，混合精度训练受限于float16表达的精度范围，单纯将float32转换成float16会影响训练收敛情况，为了保证部分计算使用float16来进行加速的同时能保证训练收敛，这里采用混合精度模块Apex来达到以上效果。混合精度模块Apex是一个集优化性能、精度收敛于一身的综合优化库。

特性支持

混合精度模块功能和优化描述如表6-1所示：

表 6-1 混合精度模块功能

功能描述

O1配置模式 Conv, Matmal等使用float16计算，其他如Softmax、BN使用float32。

O2配置除了BN使用float32外，其他绝大部分使用float16。

静态Loss Scale功能静态设置参数确保混合精度训练收敛。

动态Loss Scale功能动态计算loss Scale值并判读是否溢出。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 6 专题介绍


说明

当前版本的实现方式主要为python实现，不支持AscendCL或者CUDA优化。

将混合精度模块集成到 PyTorch 模型中

步骤1 使用apex混合精度模块需要首先从apex库中导入amp，代码如下：from apex import amp

步骤2 导入amp模块后，需要初始化amp，使其能对模型、优化器以及PyTorch内部函数进行必要的改动，初始化代码如下：model, optimizer = amp.initialize(model, optimizer)

步骤3 标记反向传播.backward()发生的位置，这样Amp就可以进行Loss Scaling并清除每次迭代的状态，代码如下：

原始代码：

loss = criterion(…) loss.backward() optimizer.step()

修改以支持loss scaling后的代码：

loss = criterion(…) with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() optimizer.step()

----结束

6.2 性能优化

6.2.1 概述在使用X86服务器进行Pytorch模型迁移训练时，部分网络模型会出现1秒内识别的图像数（fps）较低、性能不达标的情况。此时需要针对服务器进行以下优化。

● 修改CPU性能模式。

● 安装高性能pillow库。

6.2.2 修改 CPU 性能模式

设置电源策略为高性能模式

提升网络性能需要在X86服务器BIOS设置中将电源策略设为高性能模式，具体操作如下。

步骤1 登录ibmc界面，启动虚拟控制台，远程控制选择HTML5集成远程控制台，如图1。



图 6-1 远程登录控制台

步骤2 在虚拟界面工具栏中，单击启动项工具，弹出启动项配置界面，如图6-2。

图 6-2 启动项工具

步骤3 在启动项配置界面选择，选择“BIOS设置”，然后在虚拟界面工具栏中单击重启工具

，重启服务器。

步骤4 系统重启后进入BIOS配置界面，依次选择“Advanced”>“Socket Configuration”，如图3所示，BIOS详细说明请参考BIOS详细说明。



https://support.huawei.com/enterprise/en/doc/EDOC1000163372/e90a0096/introduction-to-bios-v3xx-and-earlier-versions

图 6-3 Socket Configuration

步骤5 进入 Advanced Power Mgmt. Configuration，设置Power Policy 为performance。如图4。

图 6-4 设置电源策略

步骤6 按下“F10”保存配置并重启服务器。

----结束



将 CPU 设置为 performance 模式

请使用root用户执行如下操作。

步骤1 使用如下命令查看当前CPU模式。cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

执行以上命令会输出当前CPU模式，参见表6-2。

表 6-2 CPU 模式

调速器描述

performance 运行于最大频率。

powersave 运行于最小频率。

userspace 运行于用户指定的频率。

ondemand 按需快速动态调整CPU频率，一有cpu计算量的任务，就会立即达到最大频率运行，空闲时间增加就降低频率。

conservative 按需快速动态调整CPU频率，比 ondemand 的调整更保守。

schedutil 基于调度程序调整 CPU 频率。

步骤2 安装工具，使用如下命令安装。

ubuntu/debian：

apt-get install linux-tools-$(uname -r)

centos/bclinux/euler：

yum install kernel-tools -ysystemctl daemon-reload systemctl enable cpupower systemctl start cpupower

步骤3 设置CPU为performance模式。cpupower frequency-set -g performance

步骤4 再次执行步骤1查看是否已修改。

----结束

6.2.3 安装高性能 pillow 库步骤1 安装高性能pillow库相关依赖，命令如下。

ubuntu/debian：

apt-get install libtiff5-dev libjpeg8-dev libopenjp2-7-dev zlib1g-dev libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python3-tk libharfbuzz-dev libfribidi-dev libxcb1-dev

centos/bclinux/euler：

yum install libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel



步骤2 安装高性能pillow库。

1. 执行如下命令卸载原生pillow。pip3.7 uninstall -y pillow

2. 安装SSE4版本pillow-simd。使用root用户安装，执行如下命令，若使用非root用户安装，需在命令结尾加上--user。pip3.7 install pillow-simd

说明

如果CPU支持AVX2指令集，可安装AVX2版本pillow-simd，命令如下：CC="cc -mavx2" pip3.7 install -U --force-reinstall pillow-simd

步骤3 修改torchvision代码解决pillow-simd缺少PILLOW_VERSION问题。

将/usr/local/python3.7.5/lib/python3.7/site-packages/torchvision/transforms/functional.py第5行代码修改如下：

try: from PIL import Image, ImageOps, ImageEnhance,PILLOW_VERSIONexcept: from PIL import Image, ImageOps, ImageEnhance PILLOW_VERSION="7.0.0"

----结束



7 脚本执行7.1 环境变量配置说明

7.1 环境变量配置说明由于执行训练涉及到相关启动参数，建议构建bash启动脚本，并上传至运行环境。后续在进行训练时，可以直接执行bash run_npu.sh进行训练。启动脚本主要作用是，配置训练进程启动所依赖的环境变量、拉起训练脚本，脚本示例如下所示。

说明

该示例默认fwkacllib/tfplugin/opp包的安装路径为：/home/HwHiAiUser/Ascend，driver包的安装路径为：/usr/local/Ascend。

# 训练进程启动所依赖的环境变量export LD_LIBRARY_PATH=/usr/local/:/usr/local/python3.7.5/lib/:/usr/local/openblas/lib:/usr/local/lib/:/usr/lib64/:/usr/lib/:/usr/local/Ascend/nnae/latest/fwkacllib/lib64/:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/:/usr/lib/aarch64_64-linux-gnu:$LD_LIBRARY_PATHexport PATH=$PATH:/usr/local/Ascend/nnae/latest/fwkacllib/ccec_compiler/bin/:/usr/local/Ascend/nnae/latest/toolkit/tools/ide_daemon/bin/export ASCEND_OPP_PATH=/usr/local/Ascend/nnae/latest/opp/export OPTION_EXEC_EXTERN_PLUGIN_PATH=/usr/local/Ascend/nnae/latest/fwkacllib/lib64/plugin/opskernel/libfe.so:/usr/local/Ascend/nnae/latest/fwkacllib/lib64/plugin/opskernel/libaicpu_engine.so:/usr/local/Ascend/nnae/latest/fwkacllib/lib64/plugin/opskernel/libge_local_engine.soexport PYTHONPATH=/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/auto_tune.egg/auto_tune:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/schedule_search.egg:$PYTHONPATH

#拉起训练脚本python3.7 /home/test/xxx.py \

训练进程启动所依赖的环境变量如表7-1所示。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 7 脚本执行


表 7-1 环境变量说明

配置项说明可选/必选

LD_LIBRARY_PATH 动态库的查找路径，参考上述举例配置。

必选

PYTHONPATH Python搜索路径，参考上述举例配置。

必选

PATH 可执行程序的查找路径，参考上述举例配置。

必选

ASCEND_OPP_PATH 算子根目录，参考上述举例配置。

必选

OPTION_EXEC_EXTERN_PLUGIN_PATH 算子信息库路径必选

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 7 脚本执行


8 基于 imagenet 数据集的 ResNet50 模型训练示例

8.1 样例获取

8.2 网络迁移

8.3 脚本执行

8.1 样例获取

样例获取

1. 本样例基于PyTorch官网提供的Imagenet数据集训练模型进行适配昇腾910 AI处理器的迁移改造，样例获取路径为https://github.com/pytorch/examples/tree/master/imagenet。

2. Resnet50模型参考PyTorch官网模型https://pytorch.org/hub/pytorch_vision_resnet/，实际使用有如下两种方式。

a. 直接调用对应接口，例如：import torchvision.models as models model = models.resnet50()

说明

Resnet50为PyTorch内置模型，了解更多内置模型请前往Pytorch官网。

b. 在脚本执行中直接指定参数arch为restnet50，内容如下，本样例迁移采用该种方式，请参见脚本执行。--arch resnet50

目录结构

主要文件目录结构如下所示：

├──main.py

8.2 网络迁移

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 8 基于 imagenet 数据集的 ResNet50 模型训练示例


https://github.com/pytorch/examples/tree/master/imagenethttps://github.com/pytorch/examples/tree/master/imagenethttps://pytorch.org/hub/pytorch_vision_resnet/https://pytorch.org/hub/pytorch_vision_resnet/https://pytorch.org/

8.2.1 单 P 训练修改1. main.py增加头文件以支持基于PyTorch框架的模型在昇腾910 AI处理器上训练：

import torch.npu

2. 在main.py文件中头文件后添加参数以指定使用昇腾910 AI处理器进行训练：CALCULATE_DEVICE = "npu:1"

3. 修改参数以及判断选项，使其只在昇腾910 AI处理器上进行训练。代码位置：main.py文件中的main_worker()函数（修改部分为字体加粗部分）：def main_worker(gpu, ngpus_per_node, args): global best_acc1 # 原代码为使用GPU进行训练，原代码如下： # args.gpu = gpu############## npu modify begin ############# args.gpu = None############## npu modify end ############# if args.gpu is not None: print("Use GPU: {} for training".format(args.gpu))

if args.distributed: if args.dist_url == "env://" and args.rank == -1: args.rank = int(os.environ["RANK"]) if args.multiprocessing_distributed: # For multiprocessing distributed training, rank needs to be the # global rank among all the processes args.rank = args.rank * ngpus_per_node + gpu dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank) # create model if args.pretrained: print("=> using pre-trained model '{}'".format(args.arch)) model = models.__dict__[args.arch](pretrained=True) else: print("=> creating model '{}'".format(args.arch)) model = models.__dict__[args.arch]() # 原代码中需要判断是否在GPU上进行训练，原代码如下： # if not torch.cuda.is_available(): # print('using CPU, this will be slow') # elif args.distributed:############## npu modify begin ############# # 迁移后为直接判断是否进行分布式训练，去掉判断是否在GPU上进行训练 if args.distributed:############## npu modify end ############# # For multiprocessing distributed, DistributedDataParallel constructor # should always set the single device scope, otherwise, # DistributedDataParallel will use all available devices. if args.gpu is not None: ......

4. 将模型以及损失函数迁移到昇腾910 AI处理器上进行计算。代码位置：main.py文件中的main_worker()函数（修改部分为字体加粗部分）： elif args.gpu is not None: torch.cuda.set_device(args.gpu) model = model.cuda(args.gpu) else: # DataParallel will divide and allocate batch_size to all available GPUs if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): model.features = torch.nn.DataParallel(model.features) model.cuda() else: # 原代码使用torch.nn.DataParallel()类来用多个GPU加速训练 # model = torch.nn.DataParallel(model).cuda() ############## npu modify begin ############# # 将模型迁移到NPU上进行训练。 model = model.to(CALCULATE_DEVICE) ############## npu modify end #############



# 原代码中损失函数是在GPU上进行计算 # # define loss function (criterion) and optimizer # criterion = nn.CrossEntropyLoss().cuda(args.gpu) ############## npu modify begin ############# # 将损失函数迁移到NPU上进行计算。 criterion = nn.CrossEntropyLoss().to(CALCULATE_DEVICE) ############## npu modify end #############

5. 将数据集目标结果target修改成int32类型解决算子报错问题；将数据集迁移到昇腾910 AI处理器上进行计算。– 代码位置：main.py文件中的train()函数（修改部分为字体加粗部分）：

for i, (images, target) in enumerate(train_loader): # measure data loading time data_time.update(time.time() - end)

if args.gpu is not None: images = images.cuda(args.gpu, non_blocking=True) # 原代码中训练数据集在GPU上进行加载计算，原代码如下： # if torch.cuda.is_available(): # target = target.cuda(args.gpu, non_blocking=True) ############## npu modify begin ############# # 将数据集迁移到NPU上进行计算并修改target数据类型 if 'npu' in CALCULATE_DEVICE: target = target.to(torch.int32) images, target = images.to(CALCULATE_DEVICE, non_blocking=True), target.to(CALCULATE_DEVICE, non_blocking=True) ############## npu modify end #############

– 代码位置：main.py文件中的validate()函数（修改部分为字体加粗部分）： with torch.no_grad(): end = time.time() for i, (images, target) in enumerate(val_loader): if args.gpu is not None: images = images.cuda(args.gpu, non_blocking=True) # 原代码中训练数据集在GPU上进行加载计算，原代码如下： # if torch.cuda.is_available(): # target = target.cuda(args.gpu, non_blocking=True) ############## npu modify begin ############# # 将数据集迁移到NPU上进行计算并修改target数据类型 if 'npu' in CALCULATE_DEVICE: target = target.to(torch.int32) images, target = images.to(CALCULATE_DEVICE, non_blocking=True), target.to(CALCULATE_DEVICE, non_blocking=True) ############## npu modify end #############

6. 设置当前正在使用的device。代码位置：main.py文件中的主函数入口（修改部分为字体加粗部分）：if __name__ == '__main__': ############## npu modify begin ############# if 'npu' in CALCULATE_DEVICE: torch.npu.set_device(CALCULATE_DEVICE) ############## npu modify begin ############# main()

8.2.2 分布式训练修改1. main.py增加头文件以支持基于PyTorch框架的模型在昇腾910 AI处理器上训练及

进行混合精度训练。import torch.npufrom apex import amp

2. 参数设置增加以下参数，包括指定参与训练的昇腾910 AI处理器以及进行混合精度训练需要的参数。parser.add_argument('--device', default='npu', type=str, help='npu or gpu') parser.add_argument('--addr', default='10.136.181.115', type=str, help='master addr') parser.add_argument('--device-list', default='0,1,2,3,4,5,6,7', type=str, help='device id list')parser.add_argument('--amp', default=False, action='store_true', help='use amp to train the



model') parser.add_argument('--loss-scale', default=1024., type=float, help='loss scale using in amp, default -1 means dynamic')parser.add_argument('--opt-level', default='O2', type=str, help='loss scale using in amp, default -1 means dynamic')

3. 创建由device_id到process_id的映射函数，指定device进行训练。在main.py函数中增加以下接口。def device_id_to_process_device_map(device_list): devices = device_list.split(",") devices = [int(x) for x in devices] devices.sort()

process_device_map = dict() for process_id, device_id in enumerate(devices): process_device_map[process_id] = device_id

return process_device_map

4. 指定训练服务器的ip和端口。

代码位置：main.py文件中的主函数main()（修改部分为字体加粗部分）。def main(): args = parser.parse_args() ############## npu modify begin ############# os.environ['MASTER_ADDR'] = args.addr os.environ['MASTER_PORT'] = '29688' ############## npu modify end #############

5. 创建由device_id到process_id的映射参数，获取单节点昇腾910 AI处理器数量。

代码位置：main.py文件中的主函数main()（修改部分为字体加粗部分）。args.distributed = args.world_size > 1 or args.multiprocessing_distributed############## npu modify begin #############args.process_device_map = device_id_to_process_device_map(args.device_list)if args.device == 'npu': ngpus_per_node = len(args.process_device_map)else: ngpus_per_node = torch.cuda.device_count()############## npu modify end ############## 原代码如下：# ngpus_per_node = torch.cuda.device_count()

6. 获取进程process_id对应的昇腾910 AI处理器编号，指定在对应的昇腾910 AI处理器上进行训练。

代码位置：main.py文件中的main_worker()（修改部分为字体加粗部分）。def main_worker(gpu, ngpus_per_node, args): global best_acc1 ############## npu modify begin ############# args.gpu = args.process_device_map[gpu] ############## npu modify end ############# # 原代码如下： # args.gpu = gpu

7. 初始化进程组，屏蔽掉初始化方式。

代码位置：main.py文件中的main_worker()（修改部分为字体加粗部分）。 ############## npu modify begin ############# if args.device == 'npu': dist.init_process_group(backend=args.dist_backend, #init_method=args.dist_url, world_size=args.world_size, rank=args.rank) else: dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank) ############## npu modify begin ############# # 原代码如下： # dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank)



8. 要进行分布式训练且需要引入混合精度模块，并且需要将模型迁移到昇腾AI处理器上，因此需要屏蔽掉原始代码中判断是否为分布式训练以及模型是否在GPU上进行训练的代码部分。

代码位置：main.py文件中的main_worker()（修改部分为字体加粗部分）。 # create model if args.pretrained: print("=> using pre-trained model '{}'".format(args.arch)) model = models.__dict__[args.arch](pretrained=True) else: print("=> creating model '{}'".format(args.arch)) model = models.__dict__[args.arch]()############## npu modify begin ############# # 代码中添加如下内容 # 指定训练设备为昇腾AI处理器 loc = 'npu:{}'.format(args.gpu) torch.npu.set_device(loc) # 计算用于训练的batch_size和workers args.batch_size = int(args.batch_size / ngpus_per_node) args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)############## npu modify end ############# # 原始代码如下，需屏蔽掉，已注释 # if not torch.cuda.is_available(): # print('using CPU, this will be slow') # elif args.distributed: # # For multiprocessing distributed, DistributedDataParallel constructor # # should always set the single device scope, otherwise, # # DistributedDataParallel will use all available devices. # if args.gpu is not None: # torch.cuda.set_device(args.gpu) # model.cuda(args.gpu) # # When using a single GPU per process and per # # DistributedDataParallel, we need to divide the batch size # # ourselves based on the total number of GPUs we have # args.batch_size = int(args.batch_size / ngpus_per_node) # args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) # model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) # else: # model.cuda() # # DistributedDataParallel will divide and allocate batch_size to all # # available GPUs if device_ids are not set # model = torch.nn.parallel.DistributedDataParallel(model) # elif args.gpu is not None: # torch.cuda.set_device(args.gpu) # model = model.cuda(args.gpu) # else: # # DataParallel will divide and allocate batch_size to all available GPUs # if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): # model.features = torch.nn.DataParallel(model.features) # model.cuda() # else: # model = torch.nn.DataParallel(model).cuda()

9. 屏蔽掉损失函数、优化器和断点训练部分，将这部分在后面与混合精度训练结合起来。

代码位置：main.py文件中的main_worker()（修改部分为字体加粗部分）。 # 屏蔽掉原始代码，已注释 # # define loss function (criterion) and optimizer # criterion = nn.CrossEntropyLoss().cuda(args.gpu) # # optimizer = torch.optim.SGD(model.parameters(), args.lr, # momentum=args.momentum, # weight_decay=args.weight_decay) # # # optionally resume from a checkpoint # if args.resume: # if os.path.isfile(args.resume): # print("=> loading checkpoint '{}'".format(args.resume))



# if args.gpu is None: # checkpoint = torch.load(args.resume) # else: # # Map model to be loaded to specified single gpu. # loc = 'cuda:{}'.format(args.gpu) # checkpoint = torch.load(args.resume, map_location=loc) # args.start_epoch = checkpoint['epoch'] # best_acc1 = checkpoint['best_acc1'] # if args.gpu is not None: # # best_acc1 may be from a checkpoint from a different GPU # best_acc1 = best_acc1.to(args.gpu) # model.load_state_dict(checkpoint['state_dict']) # optimizer.load_state_dict(checkpoint['optimizer']) # print("=> loaded checkpoint '{}' (epoch {})" # .format(args.resume, checkpoint['epoch'])) # else: # print("=> no checkpoint found at '{}'".format(args.resume)) # # cudnn.benchmark = True

10. 数据加载器，结合了数据集和取样器，并且可以提供多个线程处理数据集。由于是使用昇腾AI处理器进行训练，因此需要将pin_memory设置为False；由于当前仅支持固定shape下的训练，数据流中剩余的样本数可能小于batch大小，因此需要将drop_last设置为True；另外需要将验证部分数据集shuffle设置为True。代码位置：main.py文件中的main_worker()（修改部分为字体加粗部分）。 ############## npu modify begin ############# train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.workers, pin_memory=False, sampler=train_sampler, drop_last=True)

val_loader = torch.utils.data.DataLoader( datasets.ImageFolder(valdir, transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize, ])), batch_size=args.batch_size, shuffle=True, num_workers=args.workers, pin_memory=False, drop_last=True) ############## npu modify end #############

11. 进行损失函数及优化器构建，将模型、损失函数迁移到昇腾AI处理器上；将优化器、模型与混合精度模块进行结合以支持混合精度训练；将断点训练部分与混合精度模块结合以支持混合精度训练。

代码位置：main.py文件中的main_worker()中验证数据加载后（修改部分为字体加粗部分）。 val_loader = torch.utils.data.DataLoader( datasets.ImageFolder(valdir, transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize, ])), batch_size=args.batch_size, shuffle=True, num_workers=args.workers, pin_memory=False, drop_last=True)

############## npu modify begin ############# model = model.to(loc) # define loss function (criterion) and optimizer criterion = nn.CrossEntropyLoss().to(loc) optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay)

if args.amp: model, optimizer = amp.initialize(model, optimizer, opt_level=args.opt_level, loss_scale=args.loss_scale)



model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], broadcast_buffers=False)

# optionally resume from a checkpoint if args.resume: if os.path.isfile(args.resume): print("=> loading checkpoint '{}'".format(args.resume)) checkpoint = torch.load(args.resume, map_location=loc) args.start_epoch = checkpoint['epoch'] best_acc1 = checkpoint['best_acc1'] model.load_state_dict(checkpoint['state_dict']) optimizer.load_state_dict(checkpoint['optimizer']) if args.amp: amp.load_state_dict(checkpoint['amp']) print("=> loaded checkpoint '{}' (epoch {})" .format(args.resume, checkpoint['epoch'])) else: print("=> no checkpoint found at '{}'".format(args.resume))

cudnn.benchmark = True ############## npu modify end #############

12. 断点checkpoint保存需要与混合精度训练结合，修改如下。

代码位置：main.py文件中的main_worker()（修改部分为字体加粗部分）。 # remember best acc@1 and save checkpoint is_best = acc1 > best_acc1 best_acc1 = max(acc1, best_acc1)

if not args.multiprocessing_distributed or (args.multiprocessing_distributed and args.rank % ngpus_per_node == 0): ############## npu modify begin ############# if args.amp: save_checkpoint({ 'epoch': epoch + 1, 'arch': args.arch, 'state_dict': model.state_dict(), 'best_acc1': best_acc1, 'optimizer' : optimizer.state_dict(), 'amp': amp.state_dict(), }, is_best) else: save_checkpoint({ 'epoch': epoch + 1, 'arch': args.arch, 'state_dict': model.state_dict(), 'best_acc1': best_acc1, 'optimizer' : optimizer.state_dict(), }, is_best) ############## npu modify end #############

13. 训练时，需要将数据集迁移到昇腾AI处理器上，修改如下：

代码位置：main.py文件中的train()（修改部分为字体加粗部分）。 for i, (images, target) in enumerate(train_loader): # measure data loading time data_time.update(time.time() - end) ############## npu modify begin ############# loc = 'npu:{}'.format(args.gpu) target = target.to(torch.int32) images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) ############## npu modify end ############# # 原模型代码如下： # if args.gpu is not None: # images = images.cuda(args.gpu, non_blocking=True) # if torch.cuda.is_available(): # target = target.cuda(args.gpu, non_blocking=True)

14. 标记反向传播.backward()发生的位置，这样混合精度模块就可以进行LossScaling并清除每次迭代的状态，代码如下：



代码位置：main.py文件中的train()（修改部分为字体加粗部分）。 optimizer.zero_grad() ############## npu modify begin ############# if args.amp: with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() else: loss.backward() # 原代码如下注释部分： # loss.backward() ############## npu modify end ############# optimizer.step()

15. 验证时，需要将验证数据集迁移到昇腾AI处理器上，修改如下：

代码位置：main.py文件中的validate()（修改部分为字体加粗部分）。 with torch.no_grad(): end = time.time() for i, (images, target) in enumerate(val_loader): ############## npu modify begin ############# loc = 'npu:{}'.format(args.gpu) target = target.to(torch.int32) images, target = images.to(loc, non_blocking=False), target.to(loc, non_blocking=False) ############## npu modify end ############# # 原模型代码如下注释部分： # if args.gpu is not None: # images = images.cuda(args.gpu, non_blocking=True) # if torch.cuda.is_available(): # target = target.cuda(args.gpu, non_blocking=True)

8.3 脚本执行

准备数据集

准备数据集并上传到运行环境的目录下，例如：/home/data/resnet50/imagenet

配置环境变量

请参考环境变量配置说明配置环境变量。

执行命令

例如：

单卡:

python main.py /home/data/resnet50/imagenet --batch-size 128 --lr 0.1 --epochs 90 --arch resnet50 --world-size 1 --rank 0 --workers 40 --momentum 0.9 --weight-decay 1e-4

分布式：

python main.py /home/data/resnet50/imagenet --addr='10.174.216.194' --seed 49 --workers 160 --lr 0.8 --print-freq 1 --arch resnet50 --dist-url 'tcp://127.0.0.1:50000' --dist-backend 'hccl' --multiprocessing-distributed --world-size 1 --batch-size 2048 --epochs 90 --rank 0 --device-list '0' --amp

说明

dist-backend需配置成hccl以支持在昇腾AI设备上进行分布式训练。



9 FAQ9.1 pip3.7 install Pillow==5.3.0安装失败

9.2 安装“torch-*.whl ”提示“torch 1.5.0xxxx”与“torchvision”所依赖的版本不匹配

9.1 pip3.7 install Pillow==5.3.0 安装失败

现象描述

pip3.7 install pillow==5.3.0安装失败。

可能原因

缺少必要的依赖，如：libjpeg、python-devel、 zlib-devel 、libjpeg-turbo-devel等等。

处理方法

安装相关依赖，通过如下命令安装：

● CentOS/EulerOS/Tlinux/BClinux/Suseyum install libjpeg python-devel zlib-devel libjpeg-turbo-devel

● Ubuntu/Debian/UOSapt-get install libjpeg python-devel zlib-devel libjpeg-turbo-devel

9.2 安装“torch-*.whl ”提示“torch 1.5.0xxxx”与“torchvision”所依赖的版本不匹配

现象描述

安装“torch-*.whl”时，提示"ERROR：torchvision 0.6.0 has requirementtorch==1.5.0, but you'll have torch 1.5.0a0+1977093 which is incompatible"。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 9 FAQ


可能原因

安装torch时，会自动触发torchvision进行依赖版本检查，环境中安装的torchvision版本为0.6.0，检查时发现我们安装的torch-*.whl的版本号与要求的1.5.0不一致，所以提示报错，但实际安装成功。

处理方法

对实际结果无影响，无需处理。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 9 FAQ


A 修订记录发布日期修改说明

2020-10-15 第一次正式发布。

FrameworkPTAdapterPyTorch 网络模型移植&训练指南 A 修订记录


目录1 概述2 安装框架及混合精度模块2.1 获取软件包2.2 安装PyTorch框架2.3 安装混合精度模块

3 约束与限制4 网络模型迁移要点介绍4.1 接口替换

5 分布式训练介绍5.1 概述5.2 典型场景5.2.1 Server单机5.2.2 Server集群5.2.3 Atlas 300T 训练卡（型号 9000）

5.3 通过Allreduce架构进行分布式训练5.4 集合通信接口使用指导5.4.1 集合通信初始化5.4.2 集合通信接口

6 专题介绍6.1 混合精度模块Apex6.2 性能优化6.2.1 概述6.2.2 修改CPU性能模式6.2.3 安装高性能pillow库

7 脚本执行7.1 环境变量配置说明

8 基于imagenet数据集的ResNet50模型训练示例8.1 样例获取8.2 网络迁移8.2.1 单P训练修改8.2.2 分布式训练修改

8.3 脚本执行

9 FAQ9.1 pip3.7 install Pillow==5.3.0安装失败9.2 安装“torch-*.whl ”提示“torch 1.5.0xxxx”与“torchvision”所依赖的版本不匹配

A 修订记录

pytorch 网络模型移植 训练指南 · 2020. 11. 5. · frameworkptadapter v100r020c10...

Documents

pytorch 网络模型移植训练指南 · 2020. 11. 5. · frameworkptadapter v100r020c10...