【AI达人特训营】在智能交通场景中端到端加速PPYOLOE-人工智能-PHP中文网

本项目针对城市自动驾驶中基于车载摄像头的交通灯小目标识别挑战，采用PaddleDetection套件的PPYOLOE M模型训练，在A100环境实现。使用特定数据集训练，涵盖红、黄、绿、未亮灯四类。通过TensorRT加速，将NMS封装进模型实现端到端加速，A100上FP16达330FPS，较官方数据提升显著，适用于PPYOLOE系列算法。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

【ai达人特训营】在智能交通场景中端到端加速ppyoloe - php中文网

1.项目背景

红绿灯识别对于在城市地区的全自动驾驶至关重要，基于车载摄像头的交通灯识别中，目标表现为小目标，问题有较大的挑战性。本项目对常见场景中的交通信号灯进行识别，使用飞桨提供的目标检测套件PaddleDetection，训练目标检测模型，同时考虑到算法真实落地的需求，介绍了基于TensorRT的模型加速流程，相比与官方给出的测试数据有较大的提升。本项目的相关技术实现适用于PPYOLOE系列算法的所有模型，希望通过此项目介绍能推动PPYOLOE这个高性能模型的应用。[本项目仅适用与 A100环境]

2. 数据集 Link

数据集详情如下

trafficlight/
├── code
│   ├── json_merge.py # 用于json合并
│   ├── json_split.py # 按一定比例将json分成两份
├── img_without_anno
│   ├── 45张没有标注信息的交通灯图片
├── JPEGImages
│   ├── 2226张没有标注信息的交通灯图片
├── train.json
├── eval.json

登录后复制

训练集：{‘4’: 1304, ‘5’: 108, ‘6’: 1744, ‘7’: 3174} 验证集：{‘4’: 312, ‘5’: 31, ‘6’: 469, ‘7’: 748} 其中4表示红灯，5表示黄灯，6表示绿灯，7表示未亮灯

In [1]

# 解压数据集!unzip -oq /home/aistudio/data/data153372/trafficlight.zip -d /home/aistudio/data/data153372

登录后复制

可视化数据集

In [2]

import jsonimport osimport randomimport matplotlib.pyplot as plt 
import cv2from PIL import Image

登录后复制

In [3]

num_image = 20json_path = 'data/data153372/trafficlight/train.json'img_path = 'data/data153372/trafficlight/JPEGImages'with open(json_path) as annos:
    annotation_json = json.load(annos)# 统计json文件的关键字长度print('the annotation_json num_key is:', len(annotation_json))# 读出json文件的关键字print('the annotation_json key is:', annotation_json.keys())# json文件中包含的图片数量print('the annotation_json num_images is:', len(annotation_json['images']))# 获取所有类别数categories = annotation_json['categories']
categories_dict = {c['id']: c['name'] for c in categories}
class_nums = len(categories_dict.keys())print('class name', categories_dict.keys())
color = [(random.randint(0, 255), random.randint(0, 255),
            random.randint(0, 255)) for _ in range(class_nums)]# 读取图像image_name = annotation_json['images'][num_image - 1]['file_name']
img_id = annotation_json['images'][num_image - 1]['id']
image_path = os.path.join(img_path, str(image_name).zfill(5))
image = cv2.imread(image_path, 1)

annotations = annotation_json['annotations']
num_bbox = 0for anno in annotations:    if anno['image_id'] == img_id:
        num_bbox = num_bbox + 1

        class_id = anno['category_id']
        class_name = categories_dict[class_id]
        class_color = random.choice(color)        # 绘制边框
        x, y, w, h = list(map(int, anno['bbox']))
        cv2.rectangle(image, (int(x), int(y)),
                        (int(x + w), int(y + h)),
                        class_color, 2)        # 绘制文本
        font_size = 0.7
        txt_size = cv2.getTextSize(class_name, cv2.FONT_HERSHEY_SIMPLEX,
                                    font_size, 1)[0]
        cv2.rectangle(image, (x, y + 1),
                        (x + txt_size[0] + 10, y - int(2 * txt_size[1])),
                        class_color, -1)
        cv2.putText(image, class_name, (x + 5, y - 5),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    font_size, (255, 255, 255), 1)print('The unm_bbox of the display image is:', num_bbox)# 保存可视化图即可

登录后复制

the annotation_json num_key is: 3
the annotation_json key is: dict_keys(['images', 'annotations', 'categories'])
the annotation_json num_images is: 1781
class name dict_keys([4, 5, 6, 7])
The unm_bbox of the display image is: 3

登录后复制

In [4]

Image.fromarray(image[:, :, ::-1])

登录后复制

<PIL.Image.Image image mode=RGB size=1920x1080 at 0x7F4B4457AC50>

登录后复制

3.构建训练环境

PaddleDetection 简介

【AI达人特训营】在智能交通场景中端到端加速PPYOLOE - php中文网

简介

PaddleDetection为基于飞桨PaddlePaddle的端到端目标检测套件，内置30+模型算法及250+预训练模型，覆盖目标检测、实例分割、跟踪、关键点检测等方向，其中包括服务器端和移动端高精度、轻量级产业级SOTA模型、冠军方案和学术前沿算法，并提供配置化的网络模块组件、十余种数据增强策略和损失函数等高阶优化支持和多种部署方案，在打通数据处理、模型开发、训练、压缩、部署全流程的基础上，提供丰富的案例及教程，加速算法产业落地应用。

提供目标检测、实例分割、多目标跟踪、关键点检测等多种能力

应用场景覆盖工业、智慧城市、安防、交通、零售、医疗等十余种行业

特性

模型丰富: 包含目标检测、实例分割、人脸检测、关键点检测、多目标跟踪等250+个预训练模型，涵盖多种全球竞赛冠军方案。
使用简洁：模块化设计，解耦各个网络组件，开发者轻松搭建、试用各种检测模型及优化策略，快速得到高性能、定制化的算法。
端到端打通: 从数据增强、组网、训练、压缩、部署端到端打通，并完备支持云端/边缘端多架构、多设备部署。
高性能: 基于飞桨的高性能内核，模型训练速度及显存占用优势明显。支持FP16训练, 支持多机训练。

In [5]

# 拉取代码!git clone https://gitee.com/paddlepaddle/PaddleDetection.git

登录后复制

fatal: 目标路径 'PaddleDetection' 已经存在，并且不是一个空目录。

登录后复制

In [ ]

### 配置PaddleDetectionabs!rm -rf PaddleDetection/configs/
!cp -r configs/ PaddleDetection
%cd PaddleDetection
!pip install -r requirements.txt 
!python setup.py install

登录后复制

3.模型选择与训练

挂载数据集

In [7]

!ln -s  ~/data/data153372/trafficlight   dataset/trafficlight

登录后复制

模型方面，采用了PPYOLOE M 模型作为检测器

首先PP-YOLOE-l 在COCO数据集上达到了51.4mAP。相比较PP-YOLOv2提升1.9AP和13.35%的速度，相比较YOLOX提升1.3AP和24.96%的速度。

主要的改进点是：

anchor-free，
powerful backbone and neck，
TAL动态label assign，
ET-head。

【AI达人特训营】在智能交通场景中端到端加速PPYOLOE - php中文网

算法配置

超参数配置文件ppyoloe_trafficlight.yml

_BASE_: [  '../datasets/traffic.yml',  '../runtime.yml',  './_base_/optimizer_300e.yml',  './_base_/ppyoloe_crn.yml',  './_base_/ppyoloe_reader.yml',
]log_iter: 100snapshot_epoch: 10epoch: 300weights: output/ppyoloe_trafficlight/model_finalfind_unused_parameters: Truepretrain_weights: https://paddledet.bj.bcebos.com/models/ppyoloe_crn_m_300e_coco.pdparamsdepth_mult: 0.67width_mult: 0.75TrainReader:
  batch_size: 16LearningRate:
  base_lr: 0.0001
  schedulers:
    - !CosineDecay
      max_epochs: 360
    - !LinearWarmup
      start_factor: 0.00001
      steps: 2300

登录后复制

数据集traffic.yml

BetterYeah AI

基于企业知识库构建、训练AI Agent的智能体应用开发平台，赋能客服、营销、销售场景 -BetterYeah

查看详情

metric: COCOnum_classes: 4TrainDataset:
 !COCODataSet
   image_dir: JPEGImages
   anno_path: train.json
   dataset_dir: dataset/trafficlight
   data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']EvalDataset:
 !COCODataSet
   image_dir: JPEGImages
   anno_path: eval.json
   dataset_dir: dataset/trafficlightTestDataset:
 !ImageFolder
   anno_path: eval.json # also support txt (like VOC's label_list.txt)
   dataset_dir: dataset/trafficlight # if set, anno_path will be 'dataset_dir/anno_path'

登录后复制

In [ ]

# 训练模型，采用VD可视化 混合精度训练!python -m paddle.distributed.launch --gpus 0 tools/train.py -c configs/ppyoloe/ppyoloe_trafficlight.yml --eval --use_vd True  --amp

登录后复制

4.模型验证

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.356
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.594
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.380
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.261
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.571
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.438
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.263
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.503
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.509
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.695
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.526

登录后复制

5.模型导出及TRT加速

在检测模型中后处理需要用到NMS算法，在项目使用TensorRT加速YOLOv6-Python笔者介绍了如何使用TRT加速一个检测模型，在该项目中，导出ONNX模型时并没有包含NMS(paddledetection导出包含NMS的ONNX模型无法转换到TRT序列化模型)，因此NMS 的实现需要得到结果后再进行一次后处理，这样导致模型的推理速度较慢（相比于接下来介绍的端到端加速方案）。接下来我们将介绍如何将NMS封装到模型内部，实现端到端的模型加速。

In [ ]

# 导出模型、onnx模型!pip install onnxruntime onnx paddle2onnx
!python tools/export_model.py -c configs/ppyoloe/ppyoloe_trafficlight.yml --output_dir=inference_model -o weights=output/ppyoloe_trafficlight/best_model.pdparams trt=True exclude_nms=True  estReader.inputs_def.image_shape=[3,640,640]
!paddle2onnx --model_dir inference_model/ppyoloe_trafficlight/ --model_filename model.pdmodel --params_filename model.pdiparams --opset_version 13 --save_file ppyoloe.onnx --input_shape_dict "{'image':[1, 3, 640, 640], 'scale_factor': [1, 2]}"

登录后复制

NMS注册依赖

In [ ]

!python3 -m pip install --upgrade setuptools pip --user
!pip install -U numpy 
!pip install -U onnx 
!python3 -m pip install nvidia-pyindex
!python3 -m pip install --upgrade nvidia-tensorrt
!pip install pycuda
!pip install onnx_graphsurgeon
!pip install onnx-simplifier

登录后复制

使用Netron检查得到的ONNX模型，得到节点信息。

节点Mul_100，Concat_14对应模型的目标框和置信度，如下图,通过onnx_graphsurgeon将TRT支持的NMS插件注册到模型中

原始ONNX模型【AI达人特训营】在智能交通场景中端到端加速PPYOLOE - php中文网

注册NMS模型【AI达人特训营】在智能交通场景中端到端加速PPYOLOE - php中文网

In [10]

%cd /home/aistudio/

登录后复制

/home/aistudio

登录后复制

In [13]

import onnximport onnx_graphsurgeon as gsimport numpy as npfrom onnx import shape_inferencefrom collections import OrderedDictimport onnxsim

登录后复制

In [14]

## 注册NMSw = './PaddleDetection/ppyoloe.onnx'gs_graph = gs.import_onnx(onnx.load(w))# fold constantsgs_graph.fold_constants()
gs_graph.cleanup().toposort()

Mul_100 = [node for node in gs_graph.nodes if node.name=='Mul_100'][0]
Concat_14 = [node for node in gs_graph.nodes if node.name=='Concat_14'][0]

scores = gs.Variable(name='scores',shape=[1,8400,4],dtype=np.float32)
Transpose = gs.Node(name='lastTranspose',op='Transpose',
                   inputs=[Concat_14.outputs[0]],
                   outputs=[scores],
                   attrs=OrderedDict(perm=[0,2,1]))
gs_graph.nodes.append(Transpose)

Mul_100.outputs[0].name = 'boxes'gs_graph.inputs = [gs_graph.inputs[0]]
gs_graph.outputs = [Mul_100.outputs[0],scores]
gs_graph.outputs[0].dtype=np.float32
gs_graph.outputs[1].dtype=np.float32

gs_graph.cleanup().toposort()
onnx_graph = shape_inference.infer_shapes(gs.export_onnx(gs_graph))
onnx_graph, check = onnxsim.simplify(onnx_graph)

gs_graph = gs.import_onnx(onnx_graph)
op_inputs = gs_graph.outputs
op = "EfficientNMS_TRT"attrs = {    "plugin_version": "1",    "background_class": -1,    "max_output_boxes": 100,    "score_threshold": 0.25,    "iou_threshold": 0.45,    "score_activation": False,    "box_coding": 0,
}

output_num_detections = gs.Variable(
    name="num_dets",
    dtype=np.int32,
    shape=[1, 1],
)
output_boxes = gs.Variable(
    name="det_boxes",
    dtype=np.float32,
    shape=[1, 100, 4],
)
output_scores = gs.Variable(
    name="det_scores",
    dtype=np.float32,
    shape=[1, 100],
)
output_labels = gs.Variable(
    name="det_classes",
    dtype=np.int32,
    shape=[1, 100],
)
op_outputs = [
    output_num_detections, output_boxes, output_scores, output_labels
]

TRT = gs.Node(op=op,name="batched_nms",inputs=op_inputs,outputs=op_outputs,attrs=attrs)
gs_graph.nodes.append(TRT)
gs_graph.outputs = op_outputs
gs_graph.cleanup().toposort()

onnx.save(gs.export_onnx(gs_graph),'./ppyoloe_nms.onnx')print('Add NMS')

登录后复制

Add NMS

登录后复制

In [16]

import tensorrtprint(tensorrt.__version__)assert tensorrt.Builder(tensorrt.Logger())

登录后复制

8.4.1.5

登录后复制

导出TRT 序列化模型

In [ ]

!python export.py -o ppyoloe_nms.onnx -e ppyoloe_nms_fp16.trt -p fp16

登录后复制

In [18]

import numpy as npimport cv2import timeimport osimport tensorrt as trtimport pycuda.autoinitimport pycuda.driver as cuda
trt.init_libnvinfer_plugins(None,'') # 使用NMS 插件

登录后复制

True

登录后复制

In [19]

class BaseEngine(object):
    def __init__(self, engine_path):
        logger = trt.Logger(trt.Logger.WARNING)
        runtime = trt.Runtime(logger)        with open(engine_path, "rb") as f:
            serialized_engine = f.read()
        engine = runtime.deserialize_cuda_engine(serialized_engine)
        self.context = engine.create_execution_context()
        self.inputs, self.outputs, self.bindings = [], [], []
        self.stream = cuda.Stream()        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))            if engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})    def infer(self, img):
        self.inputs[0]['host'] = np.ravel(img)        # transfer data to the gpu
        for inp in self.inputs:
            cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)        # run inference
        self.context.execute_async_v2(
            bindings=self.bindings,
            stream_handle=self.stream.handle)        # fetch outputs from gpu
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)        # synchronize stream
        self.stream.synchronize()

        data = [out['host'] for out in self.outputs]        return datadef preproc(image, input_size, mean, std, swap=(2, 0, 1)):
    if len(image.shape) == 3:
        padded_img = np.ones((input_size[0], input_size[1], 3)) * 114.0
    else:
        padded_img = np.ones(input_size) * 114.0
    img = np.array(image)
    r = min(input_size[0] / img.shape[0], input_size[1] / img.shape[1])
    resized_img = cv2.resize(
        img,
        (int(img.shape[1] * r), int(img.shape[0] * r)),
        interpolation=cv2.INTER_LINEAR,
    ).astype(np.float32)
    padded_img[: int(img.shape[0] * r), : int(img.shape[1] * r)] = resized_img

    padded_img = padded_img[:, :, ::-1]
    padded_img /= 255.0
    if mean is not None:
        padded_img -= mean    if std is not None:
        padded_img /= std
    padded_img = padded_img.transpose(swap)
    padded_img = np.ascontiguousarray(padded_img, dtype=np.float32)    return padded_img, r


_COLORS = np.array(
    [        0.000, 0.447, 0.741,        0.850, 0.325, 0.098,        0.929, 0.694, 0.125,        0.494, 0.184, 0.556,        0.466, 0.674, 0.188
    ]
).astype(np.float32).reshape(-1, 3)def vis(img, boxes, scores, cls_ids, conf=0.5, class_names=None):
    for i in range(len(boxes)):
        box = boxes[i]
        cls_id = int(cls_ids[i])
        score = scores[i]        if score < conf:            continue
        x0 = int(box[0])
        y0 = int(box[1])
        x1 = int(box[2])
        y1 = int(box[3])

        color = (_COLORS[cls_id] * 255).astype(np.uint8).tolist()
        text = '{}:{:.1f}%'.format(class_names[cls_id], score * 100)
        txt_color = (0, 0, 0) if np.mean(_COLORS[cls_id]) > 0.5 else (255, 255, 255)
        font = cv2.FONT_HERSHEY_SIMPLEX

        txt_size = cv2.getTextSize(text, font, 0.4, 1)[0]
        cv2.rectangle(img, (x0, y0), (x1, y1), color, 2)

        txt_bk_color = (_COLORS[cls_id] * 255 * 0.7).astype(np.uint8).tolist()
        cv2.rectangle(
            img,
            (x0, y0 + 1),
            (x0 + txt_size[0] + 1, y0 + int(1.5 * txt_size[1])),
            txt_bk_color,
            -1
        )
        cv2.putText(img, text, (x0, y0 + txt_size[1]), font, 0.4, txt_color, thickness=1)    return img

登录后复制

In [20]

class Predictor(BaseEngine):
    def __init__(self, engine_path , imgsz=(640,640)):
        super(Predictor, self).__init__(engine_path)
        self.imgsz = imgsz
        self.class_names = [ 'Traffic Light-Red Light',                            'Traffic Light-Yellow Light',                            'Traffic Light-Green Light',                            'Traffic Light-Off']    def inference(self, img_path):
        origin_img = cv2.imread(img_path)
        mean = (0.485, 0.456, 0.406) 
        std = (0.229, 0.224, 0.225)
        img, ratio = preproc(origin_img, self.imgsz, mean, std)
        num, final_boxes, final_scores, final_cls_inds = self.infer(img)
        final_boxes = np.reshape(final_boxes, (-1, 4))
        num = num[0]        if num >0:
            final_boxes, final_scores, final_cls_inds = final_boxes[:num]/ratio, final_scores[:num], final_cls_inds[:num]
            origin_img = vis(origin_img, final_boxes, final_scores, final_cls_inds,
                             conf=0.3, class_names=self.class_names)

        cv2.imwrite("%s_ppyoloe.jpg" % os.path.splitext(
            os.path.split(img_path)[-1])[0], origin_img)    
    def get_fps(self):
        # warmup
        img = np.ones((1,3,640,640))
        img = np.ascontiguousarray(img, dtype=np.float32)        for _ in range(20):
            _ = self.infer(img)
        t1 = time.perf_counter()
        _ = self.infer(img)        print(1/(time.perf_counter() - t1), 'FPS')

登录后复制

In [22]

pred = Predictor(engine_path='ppyoloe_nms.trt')
pred.get_fps()

登录后复制

[07/07/2022-11:27:45] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.0
[07/07/2022-11:27:45] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.0
199.98711579598273 FPS

登录后复制

In [27]

pred.inference('data/data153372/trafficlight/img_without_anno/08015.jpg')

登录后复制

【AI达人特训营】在智能交通场景中端到端加速PPYOLOE - php中文网

结果展示

6.结果测试

YOLOE-M End2End

Device	FP32	FP16	int8
A100	224FPS	330FPS	-
1080Ti	101FPS	103FPS	171FPS

Origin

模型	GPU个数	每GPU图片个数	骨干网络	输入尺寸	Box APval 0.5:0.95	Box APtest 0.5:0.95	Params(M)	FLOPs(G)	V100 FP32(FPS)	V100 TensorRT FP16(FPS)	模型下载	配置文件
PP-YOLOE-s	8	32	cspresnet-s	640	43.0	43.2	7.93	17.36	208.3	333.3	model	config
PP-YOLOE-m	8	28	cspresnet-m	640	49.0	49.1	23.43	49.91	123.4	208.3	model	config
PP-YOLOE-l	8	20	cspresnet-l	640	51.4	51.6	52.20	110.07	78.1	149.2	model	config
PP-YOLOE-x	8	16	cspresnet-x	640	52.3	52.4	98.42	206.59	45.0	95.2	model	config