Release Note

Paddle-mobile 正式升级为 Paddle Lite ，重点加强了多平台多硬件的支持能力，提升了模型覆盖，并且对ARM CPU等硬件上的性能进行了重点优化。
核心能力已统一到Lite新架构上，后续的优化和升级也都会在新架构下进行。

原Paddle-mobile架构下的实现在 Mali GPU、Adreno GPU等诸多硬件场景下的表现已很出色，并有广泛应用。作为过渡，原架构下的主体代码暂在 mobile/ 目录中保留，后续一段时间会继续维护，并完成全部迁移。

苹果设备GPU的 metal 实现和前端 web 的模块目前相对独立，会继续在 metal/ 和 web/ 目录下开发和维护。

本版本重点功能升级如下：

架构重大升级，通过添加 Machine IR, Type system, 轻量级 Operator 和 Kernel 等，增加了通用多平台和多硬件支持，多 precision 和 data layout 混合调度执行，动态优化，轻量级部署等重要特性
完善了Java API，与 C++ API 一一对应
新增 NaiveBuffer 模型存储格式，移动端部署与 protobuf 解耦，使预测库尺寸更小。
通过X2Paddle 支持 Caffe 和 TensorFlow 模型的预测，目前官方验证 6 种模型转化支持，相关信息可参考使用 X2Paddle 转换模型
新增对华为海思 NPU 的深度支持，成为首个支持华为NPU在线编译的框架，已验证 MobileNet V1模型
支持FPGA （ZU3、ZU5、ZU9），已验证 ResNet50 模型
Mali GPU 和 Adreno GPU，支持 OpenCL 和 ARM CPU Kernel混合调度，已验证如下模型：
- MobileNetV1
- MobileNetV2
- ResNet50
ARM CPU ，增加官方模型覆盖，新增下列模型支持验证
- OCR-attention model
- vgg16
- EfficientNet
- mNasNet
- UNet
- ShuffleNet V1.1
- ResNet18
- MobileNet_SSD
- mobileNet_YoloV3
- mtcnn
ARM CPU 的 Int8 量化预测，支持 PaddleSlim 量化训练，精度相对离线量化更佳，新增以下模型支持验证
- MobileNetV1
- MobileNetV2
- ResNet50
更多相关模型（包含GoogleNet SqueezeNet等）及对应性能数据见 benchmark
新增各类硬件 Kernel 70种，详细列表如下
- nearest_interp: (arm/float)
- fill_constant: (arm/float)
- depthwise_conv2d: (opencl/float)
- sigmoid: (arm/float)
- argmax: (arm/float)
- calib_once: (arm/int8,fpga/fp16)
- layout: (fpga/any)
- while: (arm/float)
- crop: (arm/float)
- shape: (arm/float)
- fusion_elementwise_max_activation: (arm/float)
- slice: (arm/float)
- increment: (arm/float)
- io_copy: (fpga/any,opencl/any)
- io_copy_once: (fpga/any,opencl/any)
- sequence_expand: (arm/float)
- less_than: (arm/float)
- gru_unit: (arm/float)
- axpy: (arm/float)
- write_to_array: (arm/float)
- conv2d_transpose: (arm/float)
- scale: (fpga/fp16)
- fusion_elementwise_mul_activation: (arm/float)
- log: (arm/float)
- bilinear_interp: (arm/float)
- decode_oxes: (arm/float)
- yolo_box: (arm/float)
- beam_search_decode: (arm/float)
- negative: (arm/float)
- relu: (opencl/float,fpga/fp16,npu/float)
- elementwise_add: (opencl/float,fpga/fp16)
- box_coder: (arm/float)
- multiclass_nms: (arm/float)
- fusion_elementwise_add_activation: (arm/float,opencl/float,fpga/fp16)
- elementwise_max: (arm/float)
- graph_op: (npu/float)
- norm: (arm/float)
- logical_xor: (arm/float)
- top_k: (arm/float)
- elementwise_mul: (arm/float)
- power: (arm/float)
- sequence_pool: (arm/float)
- lrn: (arm/float)
- leaky_relu: (arm/float)
- is_empty: (arm/float)
- fc: (opencl/float,fpga/fp16,npu/float)
- im2sequence: (arm/float)
- pool2d: (opencl/float,fpga/fp16,npu/float)
- mul: (opencl/float, npu/float)
- pad2d: (arm/float)
- read_from_array: (arm/float)
- beam_search: (arm/float)
- lod_reset: (arm/float)
- gru: (arm/float)
- prelu: (arm/float)
- layout_once: (fpga/any)
- swish: (arm/float)
- lookup_table: (arm/float)
- relu_clipped: (arm/float)
- tanh: (arm/float)
- sequence_softmax: (arm/float)
- calib: (fpga/fp16)
- cast: (arm/float)
- density_prior_box: (arm/float)
- softmax: (fpga/fp16, npu/float)
- shuffle_channel: (arm/float)
- logical_and: (arm/float)
- prior_box: (arm/float)
- reduce_max: (arm/float)
- conv2d: (opencl/float,fpga/fp16, npu/float)

Assets 6

02 Jul 10:28

Superjomn

1.5.0

d409d63

PaddleMobile 1.5.0

ARM CPU：
- 增加 64 位 gemm_int8 量化实现
- 增加 conv_relu、nearestinterp、leakyrelu、density_prior_box 等 op

Android GPU：
- 实现 Winograd 卷积实现，3 x 3 卷积在计算量较大时，性能提升2倍—30倍不等
- 增加 conv_relu、scale、reshape2、transpose2、split、relu6 等 GPU op
- cl_kernel文件编译，不同op动态添加宏控制

Metal：
- 支持WinoGrad算法。
- 添加内存复用算法，材质占用的内存体积可减少一半。
- 支持MPS及非MPS混排。
- 增强框架安全性（添加通用异常处理逻辑、添加多线程保护）。
- 新增OP，支持ShuffleNet。

Web 网页版：
- 建立Web版Paddle框架，支持MobileNet、Yolo等网络。
- 基于WebGL提供完整的GPGPU功能。

发布 Lite 新架构，整体加强 Paddle 预测库底层硬件的多硬件混合调度，硬件开发扩展能力，设计要点如下

引入 Type system，强化多硬件、量化方法、data layout 的混合调度能力
硬件细节隔离，通过不同编译开关，对支持的任何硬件可以自由插拔
引入 MIR(Machine IR) 的概念，强化带执行环境下的优化支持
优化期和执行期严格隔离，保证预测时轻量和高效率

本次发布支持如下硬件：

android ARM v7, v8; 内部测试包括 mobilenet-v1, movilenet-v2, resnet50, inception-v4 等模型
armlinux 支持 ARM V8，支持如上4个模型
opencl; 测试 mobilenetv1
ARM v7, v8; Int8 量化预测; 测试 mobilenet-v1

用户文档参考 lite 文档

开发者文档参考：开发者文档

性能参数参考：Benchmark

代码 tag 位于：lite-v0.1

预编译的预测库：

Assets 2

16 Apr 13:33

hjchen2

release/1.4

8505700

PaddleMobile 1.4.0

Release Notes

ARM CPU

1、新增kernel为3x3的滑窗直接卷积实现，在channel数较少时会比winograd和gemm效率更高
2、新增winograd armv8的实现，在IOS以及v8的硬件上能取得更高的预测性能，以及算子融合时支持winograd ，保证算子融合后的效率更高
3、新增了while、sequence_expand、sequence_pool、sequence_softmax、gru_unit、beam_search和beam_search_decode等19个算子，并做了大量的优化工作，支持NLP/OCR等attention-based端到端模型的预测
4、完成矩阵运算库sgemm和sgemv的重构和效率优化，在大部分模型上能获得10%～100%以上的性能加速
5、完成kernel为3x3的depthwise convolution的重构和优化，相比之前版本支持任意的padding、性能更优且计算结果更可靠
6、完成kernel为5x5的depthwise convolution armv8版本的实现，NAS模型的预测效率提升30%以上
7、完成col2im的neon优化，提升反卷积conv2d_transpose的效率
8、新增基于图优化的精简内存复用策略，大部分模型能降低近50%的内存占用。ARM CPU已自动开启，FPGA和GPU暂不支持

ARM GPU

1、完成kernel为1x1的卷积优化，MobileNet v1在高通Adreno GPU上平均预测性能提升35%

ARM CPU

Paddle-mobile has reconstructed and enhanced efficiency of the matrix operation library sgemm and sgemv, which gives rise to performance boost of 10%~100% on most models.
19 new operators are provided in this version such as while, sequence_expand, sequence_pool, sequence_softmax, gru_unit, beam_search, and beam_search_decode. Apart from that, there has also been a large amount of optimization, and the support attention-based end-to-end Model prediction.
arm v8 of winograd implementation: higher inference performance on v8 hardware on IOS; winograd support for operator fusion to ensure higher efficiency after operator fusion.
Direct convolution for kernel with a 3x3 sliding window, which will be more efficient than winograd and gemm on the condition that the number of channels is small.
Reconstructed and optimized depthwise convolution with the kernel size 3x3: in contrast to previous versions, it supports arbitrary padding, and attains better performance and returns more reliable calculation results.
Depthwise convolution with the kernel size 5x5 on armv8: the NAS model prediction speeds up by more than 30%.
Complete the efficiency optimization of the deconvolution conv2d_transpose.
Consolidated with memory reuse strategy based on graph optimization. When the strategy is applied, most models can reduce memory usage by nearly 50%. It is automatically turned on for the ARM CPU (not compatible with FPGA and GPU).

ARM GPU

Paddle-mobile completes the convolution optimization for the kernel with size 1x1, and MobileNet v1 has an average inference performance improvement of 35% on Qualcomm Adreno GPUs.

Assets 2