Pyramid输入的多Batch ResNet18模型部署实践指导

地平线OpenExplorer工具链的PTQ链路整体使用流程包括模型优化、模型校准、模型转换为定点模型、模型编译及上板等多个阶段。 本章节以基于公版ResNet18的Pyramid输入的多Batch分类模型为例(计算平台S100),分步骤为您进行部署实践的使用演示供您进行参考。

准备浮点模型

准备ResNet18浮点模型,这里我们使用torchvision导出所需的浮点模型。

prepare_model.py
import torch import torchvision model = torchvision.models.resnet18(pretrained=True) input_shape = (1, 3, 224, 224) input_data = torch.randn(input_shape) output_path = "resnet18.onnx" torch.onnx.export(model, input_data, output_path, input_names=["input"], output_names=["output"], opset_version=10)

校准集准备

公版ResNet18模型的相关信息可参考 Pytorch文档内对ResNet18的说明,可以看到ResNet18模型的数据前处理流程为:

  1. 图像短边放缩至256。
  2. 以中心裁剪方式将图像尺寸调整至224x224。
  3. 数据归一化处理,mean取值[0.485, 0.456, 0.406],std取值[0.229, 0.224, 0.225]。

数据前处理代码示例如下:

data_preprocess.py
import os import cv2 import PIL import numpy as np from PIL import Image ori_dataset_dir = "./calibration_data/imagenet" calibration_dir = "./calibration_data_rgb" def resize_transformer(image_data: np.array, short_size: int): image = Image.fromarray(image_data.astype('uint8'), 'RGB') # Specify width, height w, h = image.size if (w <= h and w == short_size) or (h <= w and h == short_size): return np.array(image) # I.e., the width of the image is the short side if w < h: resize_size = (short_size, int(short_size * h / w)) # I.e., the height of the image is the short side else: resize_size = (int(short_size * w / h), short_size) # Resize the image data = np.array(image.resize(resize_size, Image.BILINEAR)) return data def center_crop_transformer(image_data: np.array, crop_size: int): image = Image.fromarray(image_data.astype('uint8'), 'RGB') image_width, image_height = image.size crop_height, crop_width = (crop_size, crop_size) crop_top = int(round((image_height - crop_height) / 2.)) crop_left = int(round((image_width - crop_width) / 2.)) image_data = image.crop((crop_left, crop_top, crop_left + crop_width, crop_top + crop_height)) return np.array(image_data).astype(np.float32) os.mkdir(calibration_dir) for image_name in os.listdir(ori_dataset_dir): image_path = os.path.join(ori_dataset_dir, image_name) # load the image with PIL method pil_image_data = PIL.Image.open(image_path).convert('RGB') image_data = np.array(pil_image_data).astype(np.uint8) # Resize the image image_data = resize_transformer(image_data, 256) # Crop the image image_data = center_crop_transformer(image_data, 224) # Adjust the data range from [0, 255] to [0, 1] image_data = image_data * (1 / 255) # Normalization, (data - mean) / std mean = [0.485, 0.456, 0.406] image_data = image_data - mean std = [0.229, 0.224, 0.225] image_data = image_data / std # Convert format from HWC to CHW image_data = np.transpose(image_data, (2, 0, 1)).astype(np.float32) # Convert format from CHW to NCHW image_data = image_data[np.newaxis, :] # Save the npy file cali_file_path = os.path.join(calibration_dir, image_name[:-5] + ".npy") np.save(cali_file_path, image_data)

为支持PTQ模型校准,我们需要从ImageNet数据集中取出一个小批量数据集,这里用前100张图像为例:

./imagenet ├── ILSVRC2012_val_00000001.JPEG ├── ILSVRC2012_val_00000002.JPEG ├── ILSVRC2012_val_00000003.JPEG ├── ...... ├── ILSVRC2012_val_00000099.JPEG └── ILSVRC2012_val_00000100.JPEG

则基于上文的数据前处理代码生成的校准集目录结构如下:

./calibration_data_bgr ├── ILSVRC2012_val_00000001.npy ├── ILSVRC2012_val_00000002.npy ├── ILSVRC2012_val_00000003.npy ├── ...... ├── ILSVRC2012_val_00000099.npy └── ILSVRC2012_val_00000100.npy

生成板端模型

PTQ转换链路支持命令行工具及PTQ API两种方式进行模型量化编译以生成板端模型,下方为您分别介绍两种方式的使用。

命令行工具方式

命令行工具的方式只需要您安装horizon_tc_ui(Docker环境内已预装)并根据模型信息配置创建对应的yaml文件即可, 此处我们以Pyramid输入的多Batch ResNet18模型对应的yaml文件(config.yaml)进行展示并说明。

config.yaml
model_parameters: onnx_model: 'resnet18.onnx' march: "nash-e" working_dir: 'model_output' output_model_file_prefix: 'resnet18_224x224_nv12' input_parameters: input_name: '' input_shape: '' input_type_rt: 'nv12' input_type_train: 'rgb' input_layout_train: 'NCHW' # Formula with [0.485 * 255, 0.456 * 255, 0.406 * 255] mean_value: "123.675 116.28 103.53" # Formula with [1 / (0.229*255), 1 / (0.224*255), 1 / (0.225*255)] scale_value: "0.01712475 0.017507 0.01742919" input_batch: 8 separate_batch: True calibration_parameters: cal_data_dir: './calibration_data_rgb' compiler_parameters: optimize_level: 'O2'
注解

这里将 input_nameinput_shape 直接置空,是因为工具支持单输入且输入无动态shape的场景下自动补充这两个参数(即工具内部对ONNX模型进行解析,并获取输入的name及shape信息)。

当yaml文件配置完成后,您只需要调用 hb_compile工具 执行命令即可,工具执行命令及关键log如下:

[horizon@xxx xxx]$ hb_compile -c config.yaml INFO Start hb_compile... INFO Start verifying yaml INFO End verifying yaml INFO Start to Horizon NN Model Convert. INFO Start to prepare the onnx model. INFO End to prepare the onnx model. INFO Start to optimize the onnx model. INFO End to optimize the onnx model. INFO Start to calibrate the model. INFO End to calibrate the model. INFO Start to precompile the model. INFO End to precompile the model. INFO End to Horizon NN Model Convert. INFO Successful covert model: /xxx/resnet18_224x224_nv12_quantized_model.bc [==================================================]100% INFO ############# Model input/output info ############# INFO NAME TYPE SHAPE DATA_TYPE INFO -------- ------ ---------------- --------- INFO input_y input [1, 224, 224, 1] UINT8 INFO input_uv input [1, 112, 112, 2] UINT8 INFO output output [1, 1000] FLOAT32 INFO The hb_compile completes running

命令执行完成后,在yaml文件working_dir参数配置的目录(model_output)下,将生成如下所示各阶段中间模型、最终的上板模型及模型信息文件,其中resnet18_224x224_nv12.hbm即为板端可推理的模型文件:

./model_output ├── ... ├── resnet18_224x224_nv12_calibrated_model.onnx ├── resnet18_224x224_nv12.hbm ├── resnet18_224x224_nv12_optimized_float_model.onnx ├── resnet18_224x224_nv12_original_float_model.onnx ├── resnet18_224x224_nv12_ptq_model.onnx └── resnet18_224x224_nv12_quantized_model.bc

PTQ API方式

命令行工具在提供高易用性的同时也带来了一些灵活度的降低,因此,当您有灵活性需求时,可以使用PTQ API方式来完成模型的量化编译,下方为您介绍使用API的方式生成板端模型的具体流程。

注意

请注意,由于部分接口存在较多参数,下方示例展示中我们仅对必要参数进行了配置以便于您进行整体的实践验证,具体接口的全量参数请参考 HMCT API RefernenceHBDK Tool API Reference

模型优化校准

首先,对浮点模型进行图优化及校准量化,这个过程我们使用 HMCT 的API,具体示例如下:

calibration.py
import os import logging import numpy as np from hmct.api import build_model logging.basicConfig(level=logging.INFO) march = "nash" onnx_path = "./resnet18.onnx" cali_data_dir = "./calibration_data_rgb" model_name = "resnet18_224x224_nv12" working_dir = "./model_output/" cali_data = [] for cali_data_name in os.listdir(cali_data_dir): data_path = os.path.join(cali_data_dir, cali_data_name) cali_data.append(np.load(data_path)) ptq_params = { 'cali_dict': { 'calibration_data': { 'input': cali_data } }, 'input_dict': { 'input': { 'input_batch': 8 } }, 'debug_methods': [], 'output_nodes': [] } if not os.path.exists(working_dir): os.mkdir(working_dir) build_model(onnx_file=onnx_path, march=march, name_prefix=working_dir + model_name, **ptq_params)

正确执行完build_model后,在working_dir目录下将生成各阶段ONNX模型,目录结构如下:

./model_output ├── resnet18_224x224_nv12_calibrated_model.onnx ├── resnet18_224x224_nv12_optimized_float_model.onnx ├── resnet18_224x224_nv12_original_float_model.onnx ├── resnet18_224x224_nv12_ptq_model.onnx └── resnet18_224x224_nv12_quant_info.json

这里的 *ptq_model.onnx文件即经过图优化、校准过程的ONNX模型文件,中间阶段ONNX模型的具体说明请参考训练后量化(PTQ)-PTQ转换步骤-模型量化与编译-转换产出物解读 章节。

模型转定点及编译

接下来需要完成PTQ模型转为定点模型及模型编译操作,这个过程我们需要通过编译器的API来完成,示例如下:

compile.py
import os import onnx from hbdk4.compiler.onnx import export from hbdk4.compiler import convert, compile input_batch = 8 march = "nash-e" working_dir = "./model_output/" model_name = "resnet18_224x224_nv12" ptq_onnx_path = "./model_output/resnet18_224x224_nv12_ptq_model.onnx" if not os.path.exists(working_dir): os.mkdir(working_dir) # load onnx model ptq_onnx = onnx.load(ptq_onnx_path) # Convert onnx model to hbir model ptq_model = export(proto=ptq_onnx, name=model_name) func = ptq_model.functions[0] mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] func.inputs[0].insert_split(dim=0) for i in range(input_batch - 1, -1, -1): # Convert format from NCHW to NHWC func.inputs[i].insert_transpose([0, 3, 1, 2]) # Insert node for color conversion and normalization func.inputs[i].insert_image_preprocess(mode="yuvbt601full2rgb", divisor=255, mean=mean, std=std, is_signed=True) # Insert node for conversion from nv12 to yuv444 func.inputs[i].insert_image_convert(mode="nv12") # Convert type from float to int quantized_model = convert(m=ptq_model, march=march) compile(m=quantized_model, path=working_dir + model_name + ".hbm", march=march, opt=0, progress_bar=True)

编译完成后,working_dir目录下将保存中间阶段和最终可用于上板的模型文件,目录结构如下:

./model_output ├── resnet18_224x224_nv12_calibrated_model.onnx ├── resnet18_224x224_nv12.hbm ├── resnet18_224x224_nv12_optimized_float_model.onnx ├── resnet18_224x224_nv12_original_float_model.onnx ├── resnet18_224x224_nv12_ptq_model.onnx └── resnet18_224x224_nv12_quant_info.json

可视化

生成所需hbm模型后,我们支持您通过 hb_model_infohrt_model_exec 工具可视化查看,参考命令如下:

  • hb_model_info工具
hb_model_info -v resnet18_224x224_nv12.hbm
  • hrt_model_exec工具
hrt_model_exec model_info --model_file resnet18_224x224_nv12.hbm

构建板端示例

  1. 准备板端示例所需依赖库

如需尽快完成板端示例的构建,我们建议您直接使用OE包中 samples/ucp_tutorial/deps_aarch64

目录下内容作为依赖库,板端运行示例依赖的关键头文件及动态库路径如下:

./deps_aarch64 ├── ...... └── ucp ├── include │ └── hobot │ ├── dnn │ │ ├── hb_dnn.h │ │ ├── hb_dnn_status.h │ │ └── hb_dnn_v1.h │ ├── ...... │ ├── hb_sys.h │ ├── hb_ucp.h │ ├── hb_ucp_status.h │ └── hb_ucp_sys.h └── lib ├── ...... ├── libdnn.so └── libhbucp.so
  1. 板端示例开发

下方示例展示了基于二进制文件输入和板端模型,完成一次板端模型推理并获取分类结果TOP1的过程。

main.cc
#include <fstream> #include <iostream> #include <vector> #include <cstring> #include "hobot/dnn/hb_dnn.h" #include "hobot/hb_ucp.h" #include "hobot/hb_ucp_sys.h" #define ALIGN(value, alignment) (((value) + ((alignment)-1)) & ~((alignment)-1)) #define ALIGN_32(value) ALIGN(value, 32) const char* hbm_path = "resnet18_224x224_nv12.hbm"; std::string data_y_path = "ILSVRC2012_val_00000001_y.bin"; std::string data_uv_path = "ILSVRC2012_val_00000001_uv.bin"; int input_batch = 8; // Read binary input file int read_binary_file(std::string file_path, char **bin, int *length) { std::ifstream ifs(file_path, std::ios::in | std::ios::binary); ifs.seekg(0, std::ios::end); *length = ifs.tellg(); ifs.seekg(0, std::ios::beg); *bin = new char[sizeof(char) * (*length)]; ifs.read(*bin, *length); ifs.close(); return 0; } // Prepare input tensor and output tensor int prepare_tensor(hbDNNTensor *input_tensor, hbDNNTensor *output_tensor, hbDNNHandle_t dnn_handle); int main() { // Get model handle hbDNNPackedHandle_t packed_dnn_handle; hbDNNHandle_t dnn_handle; hbDNNInitializeFromFiles(&packed_dnn_handle, &hbm_path, 1); const char **model_name_list; int model_count = 0; hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle); hbDNNGetModelHandle(&dnn_handle, packed_dnn_handle, model_name_list[0]); // Prepare input and output tensor std::vector<hbDNNTensor> input_tensors; std::vector<hbDNNTensor> output_tensors; int input_count = 0; int output_count = 0; hbDNNGetInputCount(&input_count, dnn_handle); hbDNNGetOutputCount(&output_count, dnn_handle); input_tensors.resize(input_count); output_tensors.resize(output_count); // Initialize and malloc the tensor prepare_tensor(input_tensors.data(), output_tensors.data(), dnn_handle); // Copy binary input data to input tensor int32_t data_length = 0; char *y_data = nullptr; read_binary_file(data_y_path, &y_data, &data_length); char *uv_data = nullptr; read_binary_file(data_uv_path, &uv_data, &data_length); for (auto i = 0; i < input_batch; i++) { memcpy(reinterpret_cast<char *>(input_tensors[i*2].sysMem.virAddr), y_data, input_tensors[i*2].sysMem.memSize); hbUCPMemFlush(&(input_tensors[i*2].sysMem), HB_SYS_MEM_CACHE_CLEAN); memcpy(reinterpret_cast<char *>(input_tensors[i*2+1].sysMem.virAddr), uv_data, input_tensors[i*2+1].sysMem.memSize); hbUCPMemFlush(&(input_tensors[i*2+1].sysMem), HB_SYS_MEM_CACHE_CLEAN); } free(y_data); free(uv_data); // Submit task and wait till it completed hbUCPTaskHandle_t task_handle{nullptr}; hbDNNTensor *output = output_tensors.data(); // Generate task handle hbDNNInferV2(&task_handle, output, input_tensors.data(), dnn_handle); // Submit task hbUCPSchedParam ctrl_param; HB_UCP_INITIALIZE_SCHED_PARAM(&ctrl_param); ctrl_param.backend = HB_UCP_BPU_CORE_ANY; hbUCPSubmitTask(task_handle, &ctrl_param); // Wait task completed hbUCPWaitTaskDone(task_handle, 0); // Parse inference result and calculate TOP1 hbUCPMemFlush(&output_tensors[0].sysMem, HB_SYS_MEM_CACHE_INVALIDATE); auto result = reinterpret_cast<float *>(output_tensors[0].sysMem.virAddr); for (auto batch = 0; batch < input_batch; batch++) { float max_score = 0.0; int label = -1; // Find the max score and corresponding label for (auto i = 0; i < 1000; i++) { float score = result[batch * 1000 + i]; if (score > max_score) { label = i; max_score = score; } } // Output the result std::cout << "batch[" << batch << "] " << "label: " << label << std::endl; } hbUCPReleaseTask(task_handle); // Free input memory for (int i = 0; i < input_count; i++) { hbUCPFree(&(input_tensors[i].sysMem)); } // Free output memory for (int i = 0; i < output_count; i++) { hbUCPFree(&(output_tensors[i].sysMem)); } // Release model hbDNNRelease(packed_dnn_handle); } // Prepare input tensor and output tensor int prepare_tensor(hbDNNTensor *input_tensor, hbDNNTensor *output_tensor, hbDNNHandle_t dnn_handle) { // Get input and output tensor counts int input_count = 0; int output_count = 0; hbDNNGetInputCount(&input_count, dnn_handle); hbDNNGetOutputCount(&output_count, dnn_handle); hbDNNTensor *input = input_tensor; // Get the properties of the input tensor for (int i = 0; i < input_count; i++) { hbDNNGetInputTensorProperties(&input[i].properties, dnn_handle, i); // Calculate the stride of the input tensor auto dim_len = input[i].properties.validShape.numDimensions; for (int32_t dim_i = dim_len - 1; dim_i >= 0; --dim_i) { if (input[i].properties.stride[dim_i] == -1) { auto cur_stride = input[i].properties.stride[dim_i + 1] * input[i].properties.validShape.dimensionSize[dim_i + 1]; input[i].properties.stride[dim_i] = ALIGN_32(cur_stride); } } // Calculate the memory size of the input tensor and allocate cache memory int input_memSize = input[i].properties.stride[0] * input[i].properties.validShape.dimensionSize[0]; hbUCPMallocCached(&input[i].sysMem, input_memSize, 0); } hbDNNTensor *output = output_tensor; // Get the properties of the input tensor for (int i = 0; i < output_count; i++) { hbDNNGetOutputTensorProperties(&output[i].properties, dnn_handle, i); // Calculate the memory size of the output tensor and allocate cache memory int output_memSize = output[i].properties.alignedByteSize; hbUCPMallocCached(&output[i].sysMem, output_memSize, 0); // Show how to get output name const char *output_name; hbDNNGetOutputName(&output_name, dnn_handle, i); } return 0; }
  1. 交叉编译生成板端可执行程序

进行交叉编译前,您需先准备好CMakeLists.txt和示例文件。CMakeLists.txt内容如下,因示例不包含数据前处理等操作,所以依赖较少,此处主要是对GCC的编译参数、依赖的头文件及动态库的配置。 其中dnn板端推理库,而hbucp用于对tensor做操作。

CMakeLists.txt
# CMakeLists.txt cmake_minimum_required(VERSION 3.0) project(sample) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wl,-unresolved-symbols=ignore-in-shared-libs") message(STATUS "Build type: ${CMAKE_BUILD_TYPE}") set(CMAKE_CXX_FLAGS_DEBUG "-g -O0") set(CMAKE_C_FLAGS_DEBUG "-g -O0") set(CMAKE_CXX_FLAGS_RELEASE " -O3 ") set(CMAKE_C_FLAGS_RELEASE " -O3 ") set(CMAKE_BUILD_TYPE ${build_type}) set(DEPS_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/deps_aarch64) include_directories(${DEPS_ROOT}/ucp/include) link_directories(${DEPS_ROOT}/ucp/lib) add_executable(run_sample src/main.cc) target_link_libraries(run_sample dnn hbucp)

编译的环境目录结构如下:

. ├── CMakeLists.txt ├── deps_aarch64 │ └── ucp │ ├── include │ └── lib └── src └── main.cc

当示例文件及CMakeLists.txt准备好之后即可执行编译。编译命令的示例如下:

注意

请注意,编译脚本中要将CC和CXX配置为交叉编译GCC和G++的实际路径。

#!/usr/bin/env bash # Note,please configure according to the actual path export CC=/arm-gnu-toolchain-12.2.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc export CXX=/arm-gnu-toolchain-12.2.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-g++ rm -rf arm_build; mkdir arm_build; cd arm_build cmake ..; make -j8 cd ..

编译完成后,即可生成可上板运行的 run_sample 二进制程序。至此,板端示例构建流程已全部完成。

板端运行准备

当可执行程序编译完成后,需要对模型的输入进行准备。为降低实践的操作和依赖配置成本,这里通过python对数据做处理,当然,您也可以根据数据处理逻辑在板端示例中通过C++实现(需确保数据处理逻辑相同),示例如下:

input_data.py
import os import cv2 import PIL import numpy as np from PIL import Image image_path = "./ILSVRC2012_val_00000001.JPEG" def resize_transformer(image_data: np.array, short_size: int): image = Image.fromarray(image_data.astype('uint8'), 'RGB') # Specify width, height w, h = image.size if (w <= h and w == short_size) or (h <= w and h == short_size): return np.array(image) # I.e., the width of the image is the short side if w < h: resize_size = (short_size, int(short_size * h / w)) # I.e., the height of the image is the short side else: resize_size = (int(short_size * w / h), short_size) # Resize the image data = np.array(image.resize(resize_size, Image.BILINEAR)) return data def center_crop_transformer(image_data: np.array, crop_size: int): image = Image.fromarray(image_data.astype('uint8'), 'RGB') image_width, image_height = image.size crop_height, crop_width = (crop_size, crop_size) crop_top = int(round((image_height - crop_height) / 2.)) crop_left = int(round((image_width - crop_width) / 2.)) image_data = image.crop((crop_left, crop_top, crop_left + crop_width, crop_top + crop_height)) return np.array(image_data).astype(np.float32) def rgb_to_nv12(image_data: np.array): r = image_data[:, :, 0] g = image_data[:, :, 1] b = image_data[:, :, 2] y = (0.299 * r + 0.587 * g + 0.114 * b) u = (-0.169 * r - 0.331 * g + 0.5 * b + 128)[::2, ::2] v = (0.5 * r - 0.419 * g - 0.081 * b + 128)[::2, ::2] uv = np.zeros(shape=(u.shape[0], u.shape[1] * 2)) for i in range(0, u.shape[0]): for j in range(0, u.shape[1]): uv[i, 2 * j] = u[i, j] uv[i, 2 * j + 1] = v[i, j] y = y.astype(np.uint8) uv = uv.astype(np.uint8) return y, uv if __name__ == '__main__': # load the image with PIL method pil_image_data = PIL.Image.open(image_path).convert('RGB') image_data = np.array(pil_image_data).astype(np.uint8) # Resize the image image_data = resize_transformer(image_data, 256) # Crop the image image_data = center_crop_transformer(image_data, 224) # Covert format from RGB to nv12 y, uv = rgb_to_nv12(image_data) y.tofile("ILSVRC2012_val_00000001_y.bin") uv.tofile("ILSVRC2012_val_00000001_uv.bin")

完成模型输入数据准备,正确生成用于板端示例执行推理的binary格式的输入文件后,还需要确保您现在已准备好如下内容:

  • S100开发板,用于实际执行板端程序运行。

  • 一个可用于板端推理的模型(*.hbm),即 生成板端模型 的产出物。

  • 板端程序(main.cc文件及交叉编译生成板端可执行程序),即 构建板端示例 的产出物。

  • 板端程序依赖库,为了降低部署成本,您可以直接使用OE包 samples/ucp_tutorial/deps_aarch64/ucp/lib 文件夹中的内容。

准备好之后,我们将模型文件(*.hbm)、输入数据(*.bin文件)、板端程序及依赖库整合到一起,参考目录结构如下:

horizon ├── ILSVRC2012_val_00000001_uv.bin ├── ILSVRC2012_val_00000001_y.bin ├── lib ├── resnet18_224x224_nv12.hbm └── run_sample

将此整合的文件夹整体拷贝至板端环境下,参考如下指令:

scp -r horizon/ root@{board_ip}:/map/

板端运行

最后,对LD_LIBRARY_PATH进行配置并运行程序即可,如下所示:

horizon@hobot:/map/horizon# export LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH horizon@hobot:/map/horizon# ./run_sample ...... label: 65

可以看到,LOG中打印的 label: 65 正是对应ImageNet数据集中ILSVRC2012_val_00000001图片的label,即分类结果正确。

至此,以Pyramid输入的多Batch ResNet18模型的全流程部署实践就结束了。