CN112926461B

CN112926461B - Neural network training, driving control method and device

Info

Publication number: CN112926461B
Application number: CN202110224337.0A
Authority: CN
Inventors: 王泰; 祝新革; 林达华
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-04-19
Anticipated expiration: 2041-02-26
Also published as: CN112926461A

Abstract

The disclosure provides a neural network training and driving control method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: obtaining a training sample; determining a sub-network structure in a target neural network matched with each pixel point in a two-dimensional labeling frame based on the scale of the two-dimensional labeling frame of the object to be detected in the training sample; the method comprises the steps that different sub-network structures are used for extracting characteristics of pixel points in two-dimensional labeling frames with different scales in a training sample; determining three-dimensional labeling data of an object to be detected, which corresponds to at least one pixel point in a two-dimensional labeling frame in a training sample; and training each sub-network structure in the target neural network based on at least one pixel point which corresponds to each sub-network structure and has three-dimensional labeling data, so as to obtain the trained target neural network.

Description

Neural network training, driving control method and device

技术领域Technical Field

本公开涉及深度学习技术领域，具体而言，涉及一种神经网络训练、行驶控制方法、装置、电子设备及存储介质。The present disclosure relates to the field of deep learning technology, and more specifically, to a neural network training, driving control method, device, electronic device and storage medium.

背景技术Background technique

目标检测是计算机视觉的核心任务之一，在自动驾驶、移动机器人等领域内被广泛应用。目标检测的主要任务是从图像中定位感兴趣的目标，并确定每个目标的类别和边界框。由于三维场景中目标的深度信息的确定较为困难，使得目标检测在三维场景中的应用效果不佳。比如，在自动驾驶领域内，要实现自动驾驶车辆在道路上平稳安全地行驶，必须由自动驾驶车辆确定周围每个对象的精确三维信息。Object detection is one of the core tasks of computer vision and is widely used in fields such as autonomous driving and mobile robots. The main task of object detection is to locate the objects of interest from the image and determine the category and bounding box of each object. Since it is difficult to determine the depth information of the object in a three-dimensional scene, the application effect of object detection in a three-dimensional scene is not good. For example, in the field of autonomous driving, in order to achieve smooth and safe driving of autonomous vehicles on the road, the autonomous vehicle must determine the accurate three-dimensional information of each object around it.

因此，提出一种能够较准确的确定目标的三维检测结果的方法越来越重要。Therefore, it is increasingly important to propose a method that can more accurately determine the three-dimensional detection results of the target.

发明内容Summary of the invention

有鉴于此，本公开至少提供一种神经网络训练、行驶控制方法、装置、电子设备及存储介质。In view of this, the present disclosure at least provides a neural network training, driving control method, device, electronic device and storage medium.

第一方面，本公开提供了一种神经网络训练方法，包括：In a first aspect, the present disclosure provides a neural network training method, comprising:

获取训练样本；Get training samples;

基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的目标神经网络中的子网络结构；其中，不同子网络结构用于对所述训练样本中不同尺度的所述二维标注框内的像素点进行特征提取；Based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, determine the sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box; wherein different sub-network structures are used to extract features of the pixel points in the two-dimensional annotation box of different scales in the training sample;

确定所述训练样本中所述二维标注框内的至少一个像素点对应的、所述待检测对象的三维标注数据；Determine three-dimensional annotation data of the object to be detected corresponding to at least one pixel point within the two-dimensional annotation box in the training sample;

基于各个所述子网络结构分别对应的、具有所述三维标注数据的至少一个像素点，训练所述目标神经网络中的每个所述子网络结构，得到训练后的目标神经网络。Based on at least one pixel point having the three-dimensional annotation data corresponding to each of the sub-network structures, each of the sub-network structures in the target neural network is trained to obtain a trained target neural network.

上述方法中，通过确定二维标注框内每个像素点匹配的目标神经网络中的子网络结构，使用不同尺度的二维标注框内的像素点，对对应的子网络结构进行训练，使得训练后的不同子网络结构可以对待检测图像中不同尺度的目标对象进行特征提取和三维目标检测，由于多个不同子网络结构可以进行较为密集的多层级特征提取和预测，使得利用训练样本对目标神经网络的每个子网络结构进行训练后，可以得到性能较好的用于进行三维目标检测的目标神经网络。In the above method, by determining the sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box, the corresponding sub-network structure is trained using the pixel points in the two-dimensional annotation box of different scales, so that the different sub-network structures after training can perform feature extraction and three-dimensional target detection on target objects of different scales in the image to be detected. Since multiple different sub-network structures can perform relatively intensive multi-level feature extraction and prediction, after using the training samples to train each sub-network structure of the target neural network, a target neural network with better performance for three-dimensional target detection can be obtained.

一种可能的实施方式中，所述基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的子网络结构，包括：In a possible implementation, determining the subnetwork structure matching each pixel point in the two-dimensional annotation box based on the scale of the two-dimensional annotation box of the object to be detected in the training sample includes:

针对所述二维标注框内的每个像素点，基于所述二维标注框的尺度，确定所述像素点与所述二维标注框的每条边之间的最长第一距离；For each pixel point in the two-dimensional annotation box, based on the scale of the two-dimensional annotation box, determine a first longest distance between the pixel point and each edge of the two-dimensional annotation box;

基于所述最长第一距离、和每个子网络结构对应的预设的距离范围，确定与所述像素点匹配的子网络结构。Based on the longest first distance and a preset distance range corresponding to each sub-network structure, a sub-network structure matching the pixel point is determined.

采用上述方法，可以针对二维标准框内每个像素点，基于二维标注框的尺度，确定像素点与二维标注框的每条边之间的最长第一距离，利用最长第一距离，为该像素点确定匹配的子网络结构，将具有不同的最长第一距离的像素点，匹配不同的子网络结构，实现像素级别的匹配，以便在利用子网络结构对应的多个像素点对子网络结构进行训练后，训练后的不同子网络结构可以对待检测图像中不同尺度的目标对象进行特征提取和三维目标检测。By adopting the above method, for each pixel point in the two-dimensional standard frame, based on the scale of the two-dimensional annotation frame, the longest first distance between the pixel point and each edge of the two-dimensional annotation frame can be determined, and the matching sub-network structure can be determined for the pixel point using the longest first distance. Pixel points with different longest first distances are matched with different sub-network structures to achieve pixel-level matching, so that after the sub-network structure is trained using multiple pixel points corresponding to the sub-network structure, the trained different sub-network structures can perform feature extraction and three-dimensional target detection on target objects of different scales in the image to be detected.

一种可能的实施方式中，所述确定所述训练样本中所述二维标注框内的至少一个像素点对应的、所述待检测对象的三维标注数据，包括：In a possible implementation manner, determining the three-dimensional annotation data of the object to be detected corresponding to at least one pixel point in the two-dimensional annotation box in the training sample includes:

基于与所述二维标注框内像素点分别匹配的所述子网络结构，确定所述训练样本中所述二维标注框内属于前景点的目标像素点；Determining target pixel points in the two-dimensional annotation box in the training sample that belong to foreground points based on the sub-network structures that are respectively matched with the pixel points in the two-dimensional annotation box;

确定所述目标像素点对应的、所述待检测对象的所述三维标注数据。Determine the three-dimensional annotation data of the object to be detected corresponding to the target pixel point.

一种可能的实施方式中，所述基于与所述二维标注框内像素点分别匹配的所述子网络结构，确定所述训练样本中所述二维标注框内属于前景点的目标像素点，包括：In a possible implementation, determining the target pixel points belonging to the foreground point in the two-dimensional annotation box in the training sample based on the sub-network structure respectively matched with the pixel points in the two-dimensional annotation box includes:

针对所述二维标注框内的每个像素点，确定所述像素点与所述二维标注框内所述待检测对象的中心点之间的第二距离；For each pixel point in the two-dimensional annotation frame, determining a second distance between the pixel point and a center point of the object to be detected in the two-dimensional annotation frame;

基于与所述像素点匹配的所述子网络结构的步幅和预设的半径参数，确定所述像素点对应的距离阈值；Determining a distance threshold corresponding to the pixel point based on a stride of the subnetwork structure matched with the pixel point and a preset radius parameter;

在所述第二距离小于所述距离阈值的情况下，确定所述像素点为所述目标像素点。When the second distance is smaller than the distance threshold, the pixel point is determined to be the target pixel point.

考虑到在像素点与待检测对象的中心点的距离较近时，该像素点属于前景点的概率较大，该像素点的特征信息较为可靠。因此，这里可以基于与像素点匹配的子网络结构的步幅和预设的半径参数，确定像素点对应的距离阈值；根据像素点与二维标注框内待检测对象的中心点之间的第二距离、和确定的距离阈值，确定像素点是否属于前景点，从而较准确的确定二维检框内的前景点。Considering that when the distance between a pixel point and the center point of the object to be detected is close, the probability that the pixel point belongs to a foreground point is high, and the feature information of the pixel point is relatively reliable. Therefore, here, the distance threshold corresponding to the pixel point can be determined based on the stride of the subnetwork structure matching the pixel point and the preset radius parameter; according to the second distance between the pixel point and the center point of the object to be detected in the two-dimensional annotation frame and the determined distance threshold, it is determined whether the pixel point belongs to the foreground point, thereby more accurately determining the foreground point in the two-dimensional inspection frame.

一种可能的实施方式中，所述三维标注数据包括以下数据中的至少一种：In a possible implementation manner, the three-dimensional annotation data includes at least one of the following data:

用于表征像素点与对应的待检测对象的中心点之间偏差的偏移量、像素点对应的待检测对象的中心点的深度、三维检测框的尺寸、三维检测框的朝向、朝向类别、待检测对象的速度、用于表征像素点与对应的待检测对象的中心点之间相近程度的中心度、待检测对象的目标类别、表征待检测对象状态的属性类别。The offset used to characterize the deviation between the pixel point and the corresponding center point of the object to be detected, the depth of the center point of the object to be detected corresponding to the pixel point, the size of the three-dimensional detection frame, the orientation of the three-dimensional detection frame, the orientation category, the speed of the object to be detected, the centrality used to characterize the degree of proximity between the pixel point and the corresponding center point of the object to be detected, the target category of the object to be detected, and the attribute category characterizing the state of the object to be detected.

这里，三维标注数据中包括的数据类型较为丰富、多样。Here, the types of data included in the three-dimensional annotation data are relatively rich and diverse.

一种可能的实施方式中，在基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的目标神经网络中的子网络结构之前，还包括：In a possible implementation, before determining the sub-network structure in the target neural network matched to each pixel point in the two-dimensional annotation box based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, the method further includes:

在所述训练样本中存在同一像素点位于多个二维标注框内的情况下，确定该像素点分别与每个所述二维标注框中待检测对象的中心点之间的第三距离；When there is a same pixel point in the training sample that is located in multiple two-dimensional annotation boxes, determining a third distance between the pixel point and a center point of the object to be detected in each of the two-dimensional annotation boxes;

将对应所述第三距离最小的二维标注框，作为该像素点对应的二维标注框。The two-dimensional annotation box corresponding to the smallest third distance is used as the two-dimensional annotation box corresponding to the pixel point.

采用上述方法，在利用训练样本对子网络结构进行训练时，一个像素点仅可以对应一个三维标注数据，故在同一像素点位于多个二维标注框时，可以根据同一像素点分别与每个二维标注框中待检测对象的中心点之间的第三距离，较准确的确定同一像素点对应的二维标注框，进一步较准确的确定同一像素点的三维标注数据。By adopting the above method, when the sub-network structure is trained using training samples, one pixel point can only correspond to one three-dimensional annotation data. Therefore, when the same pixel point is located in multiple two-dimensional annotation boxes, the two-dimensional annotation box corresponding to the same pixel point can be determined more accurately based on the third distance between the same pixel point and the center point of the object to be detected in each two-dimensional annotation box, and the three-dimensional annotation data of the same pixel point can be further determined more accurately.

一种可能的实施方式中，所述目标神经网络中每个子网络结构对应一个利用所述样本数据训练得到的回归指标，所述回归指标用于对与所述子网络结构相连的检测头网络输出的三维预测数据中的预测回归数据进行倍数调整；In a possible implementation, each sub-network structure in the target neural network corresponds to a regression index trained using the sample data, and the regression index is used to perform a multiple adjustment on the predicted regression data in the three-dimensional predicted data output by the detection head network connected to the sub-network structure;

其中，所述检测头网络包括分类网络和回归网络，所述检测头网络用于基于每个子网络结构输出的特征图，确定所述子网络结构对应的三维预测数据，所述三维预测数据包括所述分类网络输出的预测类别数据、和所述回归网络输出的预测回归数据。Among them, the detection head network includes a classification network and a regression network. The detection head network is used to determine the three-dimensional prediction data corresponding to the sub-network structure based on the feature map output by each sub-network structure. The three-dimensional prediction data includes the predicted category data output by the classification network and the predicted regression data output by the regression network.

考虑到检测头网络为多个子网络结构共享的，即多个子网络结构连接相同的检测头网络，由于不同子网络结构用于对不同尺度的待检测对象进行特征提取和三维目标预测，故这里可以为每个子网络结构训练一个回归指标，使用回归指标用于对与子网络结构相连的检测头网络输出的三维预测数据中的预测回归数据进行倍数调整，使得调整后的预测回归数据与子网络结构匹配。Taking into account that the detection head network is shared by multiple sub-network structures, that is, multiple sub-network structures are connected to the same detection head network, and since different sub-network structures are used to perform feature extraction and three-dimensional target prediction on objects to be detected of different scales, a regression indicator can be trained for each sub-network structure here, and the regression indicator is used to multiple-adjust the predicted regression data in the three-dimensional prediction data output by the detection head network connected to the sub-network structure, so that the adjusted predicted regression data matches the sub-network structure.

一种可能的实施方式中，在得到训练后的目标神经网络之后，还包括：In a possible implementation, after obtaining the trained target neural network, the method further includes:

获取待检测图像；Acquire the image to be detected;

利用训练后的所述目标神经网络对所述待检测图像进行检测，确定所述待检测图像中包括的至少一个目标对象的三维检测结果。The trained target neural network is used to detect the image to be detected, and a three-dimensional detection result of at least one target object included in the image to be detected is determined.

采用上述方法，可以利用性能较好的、训练后的目标神经网络对待检测图像进行检测，得到较为准确的至少一个目标对象的三维检测结果。By adopting the above method, a trained target neural network with good performance can be used to detect the image to be detected, and a relatively accurate three-dimensional detection result of at least one target object can be obtained.

一种可能的实施方式中，所述利用所述目标神经网络对所述待检测图像进行检测，确定所述待检测图像中包括的至少一个目标对象的三维检测结果，包括：In a possible implementation manner, detecting the image to be detected by using the target neural network to determine a three-dimensional detection result of at least one target object included in the image to be detected includes:

利用所述目标神经网络对所述待检测图像进行检测，生成所述待检测图像中多个像素点对应的三维检测数据；Detecting the image to be detected using the target neural network to generate three-dimensional detection data corresponding to a plurality of pixel points in the image to be detected;

基于多个像素点分别对应的三维检测数据，确定所述待检测图像中包括的多个候选三维检测框信息；Determine, based on the three-dimensional detection data respectively corresponding to the plurality of pixel points, a plurality of candidate three-dimensional detection frame information included in the image to be detected;

基于所述多个候选三维检测框信息，确定多个候选三维检测框在所述待检测图像对应的鸟瞰图中的投影框信息；Based on the information of the multiple candidate three-dimensional detection frames, determining projection frame information of the multiple candidate three-dimensional detection frames in the bird's-eye view corresponding to the image to be detected;

基于多个所述投影框信息，确定所述待检测图像中包括的至少一个目标对象的三维检测结果。Based on the plurality of projection frame information, a three-dimensional detection result of at least one target object included in the image to be detected is determined.

采用上述方法，通过确定多个候选三维检测框在待检测图像对应的鸟瞰图中的投影框信息，基于多个投影框信息，较准确的确定待检测图像中包括的至少一个目标对象的三维检测结果。By adopting the above method, by determining the projection frame information of multiple candidate three-dimensional detection frames in the bird's-eye view corresponding to the image to be detected, based on the multiple projection frame information, the three-dimensional detection result of at least one target object included in the image to be detected is more accurately determined.

一种可能的实施方式中，所述基于多个所述投影框信息，确定所述待检测图像中包括的至少一个目标对象的三维检测结果，包括：In a possible implementation manner, determining a three-dimensional detection result of at least one target object included in the image to be detected based on the plurality of projection frame information includes:

基于每个候选三维检测框信息指示的目标类别对应的置信度和中心度，确定与所述候选三维检测框对应的投影框的目标置信度；其中，所述中心度用于表征所述候选三维检测框对应的像素点、与对应的待检测对象的中心点之间的相近程度；Based on the confidence and center degree corresponding to the target category indicated by each candidate 3D detection frame information, determine the target confidence degree of the projection frame corresponding to the candidate 3D detection frame; wherein the center degree is used to characterize the degree of proximity between the pixel point corresponding to the candidate 3D detection frame and the center point of the corresponding object to be detected;

基于各个投影框分别对应的目标置信度，确定所述待检测图像中包括的至少一个目标对象的三维检测结果。Based on the target confidences respectively corresponding to the projection frames, a three-dimensional detection result of at least one target object included in the image to be detected is determined.

考虑到，中心度用于表征候选三维检测框对应的像素点、与对应的待检测对象的中心点之间的相近程度，中心度越大，则该像素点距离待检测对象的中心越近，则该像素点的特征信息可靠性较高，生成的该像素点对应的三维检测框信息的可靠性越高；故可以基于每个候选三维检测框信息指示的目标类别对应的置信度和中心度，确定与候选三维检测框对应的投影框的目标置信度；再利用目标置信度，较准确的确定待检测图像中包括的至少一个目标对象的三维检测结果。Taking into account that the center degree is used to characterize the degree of proximity between the pixel point corresponding to the candidate three-dimensional detection frame and the corresponding center point of the object to be detected, the larger the center degree, the closer the pixel point is to the center of the object to be detected, the higher the reliability of the feature information of the pixel point, and the higher the reliability of the generated three-dimensional detection frame information corresponding to the pixel point; therefore, based on the confidence and center degree corresponding to the target category indicated by each candidate three-dimensional detection frame information, the target confidence of the projection frame corresponding to the candidate three-dimensional detection frame can be determined; and then the target confidence can be used to more accurately determine the three-dimensional detection result of at least one target object included in the image to be detected.

第二方面，本公开提供了一种行驶控制方法，包括：In a second aspect, the present disclosure provides a driving control method, comprising:

获取行驶装置在行驶过程中采集的道路图像；Acquire road images collected by the driving device during driving;

利用第一方面任一项所述的神经网络训练方法训练得到的目标神经网络，对所述道路图像进行检测，得到所述道路图像中包括的目标对象的目标三维位姿数据；Using a target neural network trained by the neural network training method according to any one of the first aspects, detecting the road image to obtain target three-dimensional position data of a target object included in the road image;

基于所述道路图像中包括的目标对象的目标三维位姿数据，控制所述行驶装置。The travel device is controlled based on target three-dimensional position data of the target object included in the road image.

以下装置、电子设备等的效果描述参见上述方法的说明，这里不再赘述。The effects of the following devices, electronic devices, etc. are described in the description of the above method and will not be repeated here.

第三方面，本公开提供了一种神经网络训练装置，包括：In a third aspect, the present disclosure provides a neural network training device, comprising:

第一获取模块，用于获取训练样本；A first acquisition module, used to acquire training samples;

第一确定模块，用于基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的目标神经网络中的子网络结构；其中，不同子网络结构用于对所述训练样本中不同尺度的所述二维标注框内的像素点进行特征提取；A first determination module is used to determine the sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box based on the scale of the two-dimensional annotation box of the object to be detected in the training sample; wherein different sub-network structures are used to extract features of pixel points in the two-dimensional annotation box of different scales in the training sample;

第二确定模块，用于确定所述训练样本中所述二维标注框内的至少一个像素点对应的、所述待检测对象的三维标注数据；A second determination module is used to determine the three-dimensional annotation data of the object to be detected corresponding to at least one pixel point in the two-dimensional annotation box in the training sample;

训练模块，用于基于各个所述子网络结构分别对应的、具有所述三维标注数据的至少一个像素点，训练所述目标神经网络中的每个所述子网络结构，得到训练后的目标神经网络。A training module is used to train each of the sub-network structures in the target neural network based on at least one pixel point with the three-dimensional annotation data corresponding to each of the sub-network structures, so as to obtain a trained target neural network.

第四方面，本公开提供了一种行驶控制装置，包括：In a fourth aspect, the present disclosure provides a driving control device, comprising:

第二获取模块，用于获取行驶装置在行驶过程中采集的道路图像；A second acquisition module is used to acquire road images collected by the driving device during driving;

检测模块，用于利用第一方面任一项所述的神经网络训练方法训练得到的目标神经网络，对所述道路图像进行检测，得到所述道路图像中包括的目标对象的目标三维位姿数据；A detection module, configured to detect the road image using a target neural network trained by the neural network training method according to any one of the first aspects, and obtain target three-dimensional position data of a target object included in the road image;

控制模块，用于基于所述道路图像中包括的目标对象的目标三维位姿数据，控制所述行驶装置。A control module is used to control the driving device based on the target three-dimensional position data of the target object included in the road image.

第五方面，本公开提供一种电子设备，包括：处理器、存储器和总线，所述存储器存储有所述处理器可执行的机器可读指令，当电子设备运行时，所述处理器与所述存储器之间通过总线通信，所述机器可读指令被所述处理器执行时执行如上述第一方面或任一实施方式所述的神经网络训练方法的步骤；或执行如上述第二方面所述的行驶控制方法的步骤。In a fifth aspect, the present disclosure provides an electronic device, comprising: a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the memory communicate through the bus, and when the machine-readable instructions are executed by the processor, the steps of the neural network training method described in the first aspect or any one of the embodiments are executed; or the steps of the driving control method described in the second aspect are executed.

第六方面，本公开提供一种计算机可读存储介质，该计算机可读存储介质上存储有计算机程序，该计算机程序被处理器运行时执行如上述第一方面或任一实施方式所述的神经网络训练方法的步骤；或执行如上述第二方面所述的行驶控制方法的步骤。In a sixth aspect, the present disclosure provides a computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, the computer program executes the steps of the neural network training method as described in the first aspect or any one of the embodiments above; or executes the steps of the driving control method as described in the second aspect above.

为使本公开的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objectives, features and advantages of the present disclosure more obvious and easy to understand, preferred embodiments are specifically cited below and described in detail with reference to the accompanying drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本公开实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，此处的附图被并入说明书中并构成本说明书中的一部分，这些附图示出了符合本公开的实施例，并与说明书一起用于说明本公开的技术方案。应当理解，以下附图仅示出了本公开的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following briefly introduces the drawings required for use in the embodiments. The drawings herein are incorporated into the specification and constitute a part of the specification. These drawings illustrate embodiments consistent with the present disclosure and are used together with the specification to illustrate the technical solutions of the present disclosure. It should be understood that the following drawings only illustrate certain embodiments of the present disclosure and should not be regarded as limiting the scope. For ordinary technicians in this field, other relevant drawings can also be obtained based on these drawings without creative work.

图1示出了本公开实施例所提供的一种神经网络训练方法的流程示意图；FIG1 is a schematic diagram showing a flow chart of a neural network training method provided by an embodiment of the present disclosure;

图2示出了本公开实施例所提供的一种神经网络训练方法中，训练样本的二维标注框的示意图；FIG2 shows a schematic diagram of a two-dimensional annotation box of a training sample in a neural network training method provided by an embodiment of the present disclosure;

图3示出了本公开实施例所提供的一种神经网络训练方法中，包括多个二维标注框的训练样本的示意图；FIG3 shows a schematic diagram of a training sample including multiple two-dimensional annotation boxes in a neural network training method provided by an embodiment of the present disclosure;

图4a示出了本公开实施例所提供的一种神经网络训练方法中，待检测对象的朝向类别的示意图；FIG4a is a schematic diagram showing the orientation categories of an object to be detected in a neural network training method provided by an embodiment of the present disclosure;

图4b示出了本公开实施例所提供的一种神经网络训练方法中，待检测对象的朝向类别的示意图；FIG4b is a schematic diagram showing the orientation categories of the objects to be detected in a neural network training method provided by an embodiment of the present disclosure;

图5示出了本公开实施例所提供的一种神经网络训练方法中，目标神经网络的架构示意图；FIG5 shows a schematic diagram of the architecture of a target neural network in a neural network training method provided by an embodiment of the present disclosure;

图6示出了本公开实施例所提供的一种行驶控制方法的流程示意图；FIG6 shows a schematic flow chart of a driving control method provided by an embodiment of the present disclosure;

图7示出了本公开实施例所提供的一种神经网络训练装置的架构示意图；FIG7 shows a schematic diagram of the architecture of a neural network training device provided by an embodiment of the present disclosure;

图8示出了本公开实施例所提供的一种行驶控制装置的架构示意图；FIG8 shows a schematic diagram of the architecture of a driving control device provided by an embodiment of the present disclosure;

图9示出了本公开实施例所提供的一种电子设备的结构示意图；FIG9 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure;

图10示出了本公开实施例所提供的一种电子设备的结构示意图。FIG. 10 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

具体实施方式Detailed ways

为使本公开实施例的目的、技术方案和优点更加清楚，下面将结合本公开实施例中的附图，对本公开实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本公开一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围，而是仅仅表示本公开的选定实施例。基于本公开的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only part of the embodiments of the present disclosure, rather than all of the embodiments. The components of the embodiments of the present disclosure generally described and shown in the drawings here can be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present disclosure provided in the drawings is not intended to limit the scope of the present disclosure for protection, but merely represents the selected embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without making creative work belong to the scope of protection of the present disclosure.

为了解决上述问题，本公开实施例提供了一种神经网络训练、行驶控制方法、装置、电子设备及存储介质。In order to solve the above problems, the embodiments of the present disclosure provide a neural network training, driving control method, device, electronic device and storage medium.

针对以上方案所存在的缺陷，均是发明人在经过实践并仔细研究后得出的结果，因此，上述问题的发现过程以及下文中本公开针对上述问题所提出的解决方案，都应该是发明人在本公开过程中对本公开做出的贡献。The defects existing in the above solutions are the results obtained by the inventor after practice and careful research. Therefore, the discovery process of the above problems and the solutions proposed by the present disclosure for the above problems below should be the contributions made by the inventor to the present disclosure during the disclosure process.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that similar reference numerals and letters denote similar items in the following drawings, and therefore, once an item is defined in one drawing, it does not require further definition and explanation in the subsequent drawings.

为便于对本公开实施例进行理解，首先对本公开实施例所公开的一种神经网络训练方法、行驶控制方法进行详细介绍。本公开实施例所提供的神经网络训练方法、行驶控制方法的执行主体一般为具有一定计算能力的计算机设备，该计算机设备例如包括：终端设备或服务器或其它处理设备，终端设备可以为用户设备(User Equipment，UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant，PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中，该神经网络训练方法、行驶控制方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。To facilitate understanding of the embodiments of the present disclosure, a neural network training method and a driving control method disclosed in the embodiments of the present disclosure are first introduced in detail. The execution subject of the neural network training method and the driving control method provided in the embodiments of the present disclosure is generally a computer device with a certain computing capability, and the computer device includes, for example: a terminal device or a server or other processing device, and the terminal device can be a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc. In some possible implementations, the neural network training method and the driving control method can be implemented by a processor calling a computer-readable instruction stored in a memory.

参见图1所示，为本公开实施例所提供的神经网络训练方法的流程示意图，包括S101-S104，其中：Referring to FIG. 1 , there is shown a flowchart of a neural network training method provided by an embodiment of the present disclosure, including S101-S104, wherein:

S101，获取训练样本。S101, obtaining training samples.

S102，基于训练样本中待检测对象的二维标注框的尺度，确定二维标注框内每个像素点匹配的目标神经网络中的子网络结构；其中，不同子网络结构用于对训练样本中不同尺度的二维标注框内的像素点进行特征提取。S102, based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, determine the sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box; wherein different sub-network structures are used to extract features of pixel points in two-dimensional annotation boxes of different scales in the training sample.

S103，确定训练样本中二维标注框内的至少一个像素点对应的、待检测对象的三维标注数据。S103, determining three-dimensional annotation data of the object to be detected corresponding to at least one pixel point in the two-dimensional annotation box in the training sample.

S104，基于各个子网络结构分别对应的、具有三维标注数据的至少一个像素点，训练目标神经网络中的每个子网络结构，得到训练后的目标神经网络。S104, based on at least one pixel point with three-dimensional annotation data corresponding to each sub-network structure, train each sub-network structure in the target neural network to obtain a trained target neural network.

上述方法中，通过确定二维标注框内每个像素点匹配的目标神经网络中的子网络结构，使用不同尺度的二维标注框内的像素点，对对应的子网络结构进行训练，使得训练后的不同子网络结构可以用于对待检测图像中不同尺度的目标对象进行特征提取和三维目标检测，由于多个不同子网络结构可以进行较为密集的多层级特征提取和预测，使得利用训练样本对目标神经网络的每个子网络结构进行训练后，得到性能较好的用于进行三维目标检测的目标神经网络。In the above method, by determining the sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box, the corresponding sub-network structure is trained using the pixel points in the two-dimensional annotation box of different scales, so that the different sub-network structures after training can be used for feature extraction and three-dimensional target detection of target objects of different scales in the image to be detected. Since multiple different sub-network structures can perform relatively intensive multi-level feature extraction and prediction, after using the training samples to train each sub-network structure of the target neural network, a target neural network with better performance for three-dimensional target detection is obtained.

下述对S101-S104进行具体说明。S101 to S104 are described in detail below.

针对S101：For S101:

获取训练样本，训练样本中包括多种待检测对象对应的样本图像。示例性的。在该方法应用于自动驾驶领域内时，待检测图像可以包括机动车、非机动车、行人、动物等。Acquire training samples, which include sample images corresponding to a plurality of objects to be detected. Exemplary. When the method is applied in the field of autonomous driving, the images to be detected may include motor vehicles, non-motor vehicles, pedestrians, animals, etc.

训练样本中包括的每个待检测对象对应一个二维标注框，该二维标注框可以为人工进行标注的，也可以为使用训练后的二维检测神经网络进行自动标注的。每个二维标注框对应一个尺寸，比如，该二维标注框的尺寸可以为56×512、512×218、1024×218、1024×1024等。Each object to be detected included in the training sample corresponds to a two-dimensional annotation box, which can be manually annotated or automatically annotated using a trained two-dimensional detection neural network. Each two-dimensional annotation box corresponds to a size, for example, the size of the two-dimensional annotation box can be 56×512, 512×218, 1024×218, 1024×1024, etc.

针对S102：For S102:

目标神经网络中包括金字塔网络结构，该金字塔网络结构中包括多个不同的子网络结构，不同的子网络结构用于对训练样本中不同尺度的二维标注框内的像素点进行特征提取。The target neural network includes a pyramid network structure, which includes a plurality of different sub-network structures. The different sub-network structures are used to extract features of pixel points in two-dimensional annotation boxes of different scales in training samples.

示例性的，可以基于训练样本中二维标注框的尺度指示的最大长度，确定二维标注框内每个像素点匹配的子网络结构。比如，可以为每个子网络结构设置一个尺度范围，子网络结构一对应的尺度范围为(218-512]，子网络结构二对应的尺度范围(512-1024]，若二维标注框一的尺度为512×218，该二维标注框的尺度指示的最大长度为512，故二维标注框一中每个像素点匹配子网络结构一。Exemplarily, the subnetwork structure that matches each pixel in the two-dimensional annotation box can be determined based on the maximum length of the scale indication of the two-dimensional annotation box in the training sample. For example, a scale range can be set for each subnetwork structure, and the scale range corresponding to subnetwork structure one is (218-512], and the scale range corresponding to subnetwork structure two is (512-1024], if the scale of two-dimensional annotation box one is 512×218, the maximum length of the scale indication of the two-dimensional annotation box is 512, so each pixel in two-dimensional annotation box one matches subnetwork structure one.

在另一可选实施方式中，基于训练样本中待检测对象的二维标注框的尺度，确定二维标注框内每个像素点匹配的子网络结构，包括：In another optional implementation, based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, determining the subnetwork structure matching each pixel point in the two-dimensional annotation box includes:

S1021，针对二维标注框内的每个像素点，基于二维标注框的尺度，确定像素点与二维标注框的每条边之间的最长第一距离；S1021, for each pixel point in the two-dimensional annotation box, based on the scale of the two-dimensional annotation box, determining a first longest distance between the pixel point and each edge of the two-dimensional annotation box;

S1022，基于最长第一距离、和每个子网络结构对应的预设的距离范围，确定与像素点匹配的子网络结构。S1022: Determine a sub-network structure that matches the pixel point based on the longest first distance and a preset distance range corresponding to each sub-network structure.

在S1021中，针对二维标注框内每个像素点，基于二维标注框的尺度和该像素点的位置信息，可以确定该像素点与二维标注框的每条边之间的第一距离，即可以得到该像素点分别距离四条边之间的第一距离。参见图2所示一种二维标注框的示意图，该图中像素点21距离四条边的第一距离分别为t、b、l、r，可知l的值最大，故该第一距离l为最长第一距离。In S1021, for each pixel point in the two-dimensional annotation frame, based on the scale of the two-dimensional annotation frame and the position information of the pixel point, the first distance between the pixel point and each edge of the two-dimensional annotation frame can be determined, that is, the first distances between the pixel point and the four edges can be obtained. Referring to the schematic diagram of a two-dimensional annotation frame shown in FIG2 , the first distances between the pixel point 21 and the four edges in the figure are t, b, l, and r, respectively. It can be seen that l has the largest value, so the first distance l is the longest first distance.

在S1022中，可以为每个子网络结构设置一对应的距离范围，比如，子网络结构一对应的距离范围为(0-256]、子网络结构二对应的距离范围为(256-512]、子网络结构三对应的距离范围为(512-1024]，在像素点A对应的最长第一距离为500时，确定像素点A匹配子网络结构二；在像素点B对应的最长第一距离为218时，确定像素点B匹配子网络结构一。In S1022, a corresponding distance range can be set for each sub-network structure. For example, the distance range corresponding to sub-network structure one is (0-256], the distance range corresponding to sub-network structure two is (256-512], and the distance range corresponding to sub-network structure three is (512-1024]. When the longest first distance corresponding to pixel point A is 500, it is determined that pixel point A matches sub-network structure two; when the longest first distance corresponding to pixel point B is 218, it is determined that pixel point B matches sub-network structure one.

通过S1021和S1022可以得到二维标注框内每个像素点匹配的子网络结构，即得到了每个子网络结构对应的多个像素点。Through S1021 and S1022, the sub-network structure matching each pixel point in the two-dimensional annotation box can be obtained, that is, multiple pixel points corresponding to each sub-network structure are obtained.

一种可选实施方式中，在基于训练样本中待检测对象的二维标注框的尺度，确定二维标注框内每个像素点匹配的目标神经网络中的子网络结构之前，还包括：In an optional implementation, before determining the sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, the method further includes:

步骤一，在训练样本中存在同一像素点位于多个二维标注框内的情况下，确定该像素点分别与每个二维标注框中待检测对象的中心点之间的第三距离；Step 1: when the same pixel point in the training sample is located in multiple two-dimensional annotation frames, determine the third distance between the pixel point and the center point of the object to be detected in each two-dimensional annotation frame;

步骤二，将对应第三距离最小的二维标注框，作为该像素点对应的二维标注框。Step 2: Use the two-dimensional annotation box corresponding to the smallest third distance as the two-dimensional annotation box corresponding to the pixel point.

采用上述方法，在利用训练样本中对子网络结构进行训练时，一个像素点仅可以对应一个三维标注数据，故在同一像素点位于多个二维标注框时，可以根据同一像素点分别与每个二维标注框中待检测对象的中心点之间的第三距离，较准确的确定同一像素点对应的二维标注框，进一步较准确的确定同一像素点的三维标注数据。By adopting the above method, when the sub-network structure is trained using the training samples, one pixel point can only correspond to one three-dimensional annotation data. Therefore, when the same pixel point is located in multiple two-dimensional annotation boxes, the two-dimensional annotation box corresponding to the same pixel point can be determined more accurately based on the third distance between the same pixel point and the center point of the object to be detected in each two-dimensional annotation box, and the three-dimensional annotation data of the same pixel point can be further determined more accurately.

该训练样本中存在多个待检测对象，每个待检测对象对应一个二维标注框，且二维标注框之间存在重叠，位于重叠区域内的像素点位于多个二维标注框，故需要从多个二维标注框中，为该像素点确定一个对应的二维标注框。There are multiple objects to be detected in the training sample, each object to be detected corresponds to a two-dimensional annotation box, and there is overlap between the two-dimensional annotation boxes. The pixel points in the overlapping area are located in multiple two-dimensional annotation boxes, so it is necessary to determine a corresponding two-dimensional annotation box for the pixel point from multiple two-dimensional annotation boxes.

具体实施时，可以确定该像素点与每个二维标注框中待检测对象的中心点之间的第三距离，即确定该像素点与该检测对象的中心点之间的第三距离。比如，可以通过欧式距离确定方式，确定该像素点与该检测对象的中心点之间的第三距离。In a specific implementation, the third distance between the pixel point and the center point of the object to be detected in each two-dimensional annotation frame can be determined, that is, the third distance between the pixel point and the center point of the detected object can be determined. For example, the third distance between the pixel point and the center point of the detected object can be determined by a Euclidean distance determination method.

待检测对象的中心点可以为待检测对象在真实场景的中心投影至训练样本后确定的投影点。示例性的，可以使用训练样本对应的采集设备的内参矩阵，确定待检测对象的中心点的坐标。The center point of the object to be detected can be a projection point determined after the object to be detected is projected from the center of the real scene to the training sample. Exemplarily, the coordinates of the center point of the object to be detected can be determined using the intrinsic parameter matrix of the acquisition device corresponding to the training sample.

参见图3所示，该图中包括待检测对象一对应的二维标注框31，位于二维标注框31内待检测对象一的中心点311，以及待检测对象二对应的二维标注框32，位于二维标注框32内待检测对象二的中心点321，针对位于二维标注框31和二维标注框32重叠区域的像素点33，可以计算该像素点33与待检测对象一的中心点311的第三距离，和计算该像素点33与待检测对象二的中心点321的第三距离，可知该像素点33与待检测对象一的中心点311的第三距离较小，故像素点33属于二维标注框31。As shown in Figure 3, the figure includes a two-dimensional annotation box 31 corresponding to the object to be detected, a center point 311 of the object to be detected located in the two-dimensional annotation box 31, and a two-dimensional annotation box 32 corresponding to the object to be detected, and a center point 321 of the object to be detected located in the two-dimensional annotation box 32. For the pixel point 33 located in the overlapping area of the two-dimensional annotation box 31 and the two-dimensional annotation box 32, the third distance between the pixel point 33 and the center point 311 of the object to be detected can be calculated, and the third distance between the pixel point 33 and the center point 321 of the object to be detected can be calculated. It can be seen that the third distance between the pixel point 33 and the center point 311 of the object to be detected is small, so the pixel point 33 belongs to the two-dimensional annotation box 31.

针对S103：For S103:

可以确定训练样本中二维标注框内至少一个像素点对应的、待检测对象的三维标注数据。其中，三维标注数据包括以下数据中的至少一种：偏移量、像素点对应的待检测对象的中心点的深度、三维检测框的尺寸、三维检测框的朝向、朝向类别、待检测对象的速度、中心度、待检测对象的目标类别、属性类别。这里，三维标注数据中包括的数据类型较为丰富、多样。The three-dimensional annotation data of the object to be detected corresponding to at least one pixel point in the two-dimensional annotation box in the training sample can be determined. The three-dimensional annotation data includes at least one of the following data: offset, depth of the center point of the object to be detected corresponding to the pixel point, size of the three-dimensional detection box, orientation of the three-dimensional detection box, orientation category, speed of the object to be detected, center degree, target category of the object to be detected, and attribute category. Here, the data types included in the three-dimensional annotation data are relatively rich and diverse.

这里，像素点对应的待检测对象是指该像素点所属的待检测对象，比如，第一像素点位于待检测对象一上，则第一像素点对应的待检测对象为待检测对象一。Here, the object to be detected corresponding to the pixel point refers to the object to be detected to which the pixel point belongs. For example, if the first pixel point is located on the first object to be detected, then the object to be detected corresponding to the first pixel point is the first object to be detected.

偏移量可以用于表征像素点与对应的待检测对象的中心点之间的偏差，该偏移量中包括横轴偏移量Δx、和纵轴偏移量Δy，横轴偏移量用于表征像素点的横轴坐标值、与待检测对象的中心点指示的横轴坐标值之间的偏差，纵轴偏移量用于表征像素点的纵轴坐标值、与待检测对象的中心点指示的纵轴坐标值之间的偏差。三维标注数据中的深度为像素点对应的待检测对象的中心点的深度，比如，图3中像素点33的三维标注数据中指示的深度为待检测对象一的中心点311在真实场景中的深度。The offset can be used to characterize the deviation between the pixel point and the corresponding center point of the object to be detected, and the offset includes a horizontal axis offset Δx and a vertical axis offset Δy. The horizontal axis offset is used to characterize the deviation between the horizontal axis coordinate value of the pixel point and the horizontal axis coordinate value indicated by the center point of the object to be detected, and the vertical axis offset is used to characterize the deviation between the vertical axis coordinate value of the pixel point and the vertical axis coordinate value indicated by the center point of the object to be detected. The depth in the three-dimensional annotation data is the depth of the center point of the object to be detected corresponding to the pixel point. For example, the depth indicated in the three-dimensional annotation data of the pixel point 33 in FIG3 is the depth of the center point 311 of the object to be detected in the real scene.

三维检测框的尺寸为该像素点对应的待检测对象的三维检测框的尺寸信息。比如，图3中像素点33对应的三维标注数据指示的三维检测框的尺寸为待检测对象一对应的三维检测框的尺寸。The size of the 3D detection frame is the size information of the 3D detection frame of the object to be detected corresponding to the pixel point. For example, the size of the 3D detection frame indicated by the 3D annotation data corresponding to the pixel point 33 in FIG3 is the size of the 3D detection frame corresponding to the object to be detected.

三维检测框的朝向为0至π(180°)之间的角度，朝向类别可以包括正向类别和反向类别，或者，朝向类别可以包括第一类别和第二类别。具体实施时，可以通过三维检测框的朝向和朝向类别，较准确的表征待检测对象的方向。The orientation of the three-dimensional detection frame is an angle between 0 and π (180°), and the orientation category may include a forward category and a reverse category, or the orientation category may include a first category and a second category. In specific implementation, the orientation of the three-dimensional detection frame and the orientation category can be used to more accurately characterize the direction of the object to be detected.

参见图4a所示一种待检测对象的朝向类别的示意图，该图中示出了待检测对象的两种朝向类别，即图4a中示出左侧的第一类别(或正向类别)和图4a中右侧示出的第二类别(反向类别)。在获取了训练样本后，可以确定训练样本中待检测对象的朝向类别和在该朝向类别下的朝向(即角度)。Referring to a schematic diagram of the orientation category of an object to be detected shown in FIG4a, two orientation categories of the object to be detected are shown in the figure, namely, the first category (or positive category) shown on the left side of FIG4a and the second category (reverse category) shown on the right side of FIG4a. After obtaining the training sample, the orientation category of the object to be detected in the training sample and the orientation (i.e., angle) under the orientation category can be determined.

比如，针对图4b中左侧的待检测对象，可以确定该待检测对象的朝向为θ₁、朝向类别为第一类别(或正向类别)；针对图4b中右侧的待检测对象，可以确定该待检测对象的朝向为θ₂、朝向类别为第二类别(或反向类别)。For example, for the object to be detected on the left side of FIG4b , the orientation of the object to be detected can be determined to be θ ₁ and the orientation category can be determined to be the first category (or the positive category); for the object to be detected on the right side of FIG4b , the orientation of the object to be detected can be determined to be θ ₂ and the orientation category can be determined to be the second category (or the negative category).

待检测对象的速度可以包括横向速度v_x和纵向速度v_y，即横向速度用于表征待检测对象在横轴方向上的行驶速度，纵向速度用于表征待检测对象在纵轴方向上的行驶速度。The speed of the object to be detected may include a lateral speed v _x and a longitudinal speed v _y , that is, the lateral speed is used to characterize the traveling speed of the object to be detected in the horizontal axis direction, and the longitudinal speed is used to characterize the traveling speed of the object to be detected in the longitudinal axis direction.

中心度用于表征像素点与对应的待检测对象的中心点之间相近程度的，其中，可以根据下述公式(1)确定中心度：The centrality is used to characterize the closeness between a pixel point and the corresponding central point of the object to be detected. The centrality can be determined according to the following formula (1):

其中，Δx为横轴偏移量，Δy为纵轴偏移量，α为设置的用于调整从待检测中心点到待检测对象外围的强度衰减的参数。Wherein, Δx is the horizontal axis offset, Δy is the vertical axis offset, and α is a parameter set to adjust the intensity attenuation from the center point to be detected to the periphery of the object to be detected.

待检测对象的目标类别包括机动车的类别、非机动车的类别、行人的类别、动物的类别等。待检测对象的属性类别用于表征待检测对象的状态，比如，待检测对象的属性类别可以包括：移动、暂停(表征待检测对象为短时间静止)、停止(表征待检测对象为长时间静止)、骑车、步行、行人站立、行人平躺、行人坐立等。The target categories of the objects to be detected include the categories of motor vehicles, non-motor vehicles, pedestrians, animals, etc. The attribute categories of the objects to be detected are used to characterize the state of the objects to be detected. For example, the attribute categories of the objects to be detected may include: moving, pausing (characterizing that the objects to be detected are stationary for a short time), stopping (characterizing that the objects to be detected are stationary for a long time), riding a bicycle, walking, pedestrians standing, pedestrians lying flat, pedestrians sitting, etc.

在S103中，确定训练样本中二维标注框内的至少一个像素点对应的、待检测对象的三维标注数据，包括：In S103, determining the three-dimensional annotation data of the object to be detected corresponding to at least one pixel point in the two-dimensional annotation box in the training sample includes:

S1031，基于与二维标注框内像素点分别匹配的子网络结构，确定训练样本中二维标注框内属于前景点的目标像素点；S1031, determining target pixel points belonging to foreground points in the two-dimensional annotation box in the training sample based on the sub-network structures that are matched with the pixel points in the two-dimensional annotation box respectively;

S1032，确定目标像素点对应的、待检测对象的三维标注数据。S1032, determining the three-dimensional annotation data of the object to be detected corresponding to the target pixel point.

这里可以先基于与二维标注框内像素点分别匹配的子网络结构，确定训练样本中二维标注框内属于前景点的目标像素点；以及将二维标注框内除目标像素点之外的其他像素点确定为背景点。然后再确定属于前景点的目标像素点的三维标注数据，而属于背景点的其他像素点可以视为不存在对应的三维标注数据。Here, we can first determine the target pixel points in the two-dimensional annotation box of the training sample that belong to the foreground point based on the sub-network structure that matches the pixel points in the two-dimensional annotation box respectively; and determine the other pixel points in the two-dimensional annotation box except the target pixel points as background points. Then, we determine the three-dimensional annotation data of the target pixel points that belong to the foreground point, and other pixel points that belong to the background point can be regarded as having no corresponding three-dimensional annotation data.

一种可能的实施方式中，S1031中，基于与二维标注框内像素点分别匹配的子网络结构，确定训练样本中二维标注框内属于前景点的目标像素点，包括：In a possible implementation, in S1031, based on the sub-network structures respectively matching the pixels in the two-dimensional annotation box, determining the target pixel points in the two-dimensional annotation box in the training sample that belong to the foreground point includes:

步骤一，针对二维标注框内的每个像素点，确定像素点与二维标注框内待检测对象的中心点之间的第二距离；Step 1: for each pixel point in the two-dimensional annotation frame, determine a second distance between the pixel point and the center point of the object to be detected in the two-dimensional annotation frame;

步骤二，基于与像素点匹配的子网络结构的步幅和预设的半径参数，确定像素点对应的距离阈值；Step 2: Determine the distance threshold corresponding to the pixel point based on the stride of the subnetwork structure matching the pixel point and the preset radius parameter;

步骤三，在第二距离小于距离阈值的情况下，确定像素点为目标像素点。Step three, when the second distance is less than the distance threshold, determine the pixel point as the target pixel point.

考虑到在像素点与待检测对象的中心点的距离较近时，该像素点属于前景点的概率较大，该像素点的特征信息较为可靠。因此，这里可以基于与像素点匹配的子网络结构的步幅和预设的半径参数，确定像素点对应的距离阈值；根据像素点与二维标注框内待检测对象的中心点之间的第二距离、和确定的距离阈值，确定像素点是否属于前景点，较准确的确定二维检框内的前景点。Considering that when the distance between a pixel point and the center point of the object to be detected is close, the probability that the pixel point belongs to the foreground point is high, and the feature information of the pixel point is relatively reliable. Therefore, here, the distance threshold corresponding to the pixel point can be determined based on the stride of the subnetwork structure matching the pixel point and the preset radius parameter; according to the second distance between the pixel point and the center point of the object to be detected in the two-dimensional annotation frame and the determined distance threshold, it is determined whether the pixel point belongs to the foreground point, and the foreground point in the two-dimensional inspection frame is determined more accurately.

子网络结构的步幅stride是预先设置的，不同的子网络结构对应不同的步幅。半径参数为预设的、用于确定二维标注框内的前景点的参数，比如半径参数可以为1.5个像素点对应的长度，或者，可以为0.5cm等。该半径参数是针对各个子网络结构，即各个子网络结构对应一个半径参数，由于不同的子网络结构对应不同的步幅，故可以基于步幅和半径参数，确定训练样本中二维标注框内的每个像素点对应的距离阈值。The stride of the sub-network structure is preset, and different sub-network structures correspond to different strides. The radius parameter is a preset parameter used to determine the foreground point in the two-dimensional annotation box. For example, the radius parameter can be the length corresponding to 1.5 pixels, or it can be 0.5cm, etc. The radius parameter is for each sub-network structure, that is, each sub-network structure corresponds to a radius parameter. Since different sub-network structures correspond to different strides, the distance threshold corresponding to each pixel point in the two-dimensional annotation box in the training sample can be determined based on the stride and radius parameters.

由于前景点为距离待检测对象中心点较近的像素点，背景点为距离待检测对象中心点较远的像素点。故在确定了训练样本中二维标注框内的每个像素点的第二距离之后，判断像素点的第二距离是否小于确定的距离阈值，若是，则确定该像素点属于前景点；若否，则确定该像素点属于背景点。即确定了二维标注框内属于前景点的目标像素点、和属于背景点的其他像素点。接着，可以确定属于前景点的目标像素点对应的三维标注数据。Since the foreground point is a pixel point that is closer to the center point of the object to be detected, and the background point is a pixel point that is farther from the center point of the object to be detected. Therefore, after determining the second distance of each pixel point in the two-dimensional annotation box in the training sample, it is determined whether the second distance of the pixel point is less than the determined distance threshold. If so, it is determined that the pixel point belongs to the foreground point; if not, it is determined that the pixel point belongs to the background point. That is, the target pixel point belonging to the foreground point and the other pixel points belonging to the background point in the two-dimensional annotation box are determined. Then, the three-dimensional annotation data corresponding to the target pixel point belonging to the foreground point can be determined.

针对S104：For S104:

针对每个子网络结构，基于该子网络结构对应的包含有三维标注数据的至少一个像素点，训练该子网络结构。或者，还可以基于该子网络结构对应的包含有三维标注数据的至少一个像素点、和该子网络结构对应的属于背景点的至少一个像素点，训练该子网络结构。通过对各个子网络结构进行多轮训练，得到训练后的目标神经网络。For each sub-network structure, the sub-network structure is trained based on at least one pixel point containing three-dimensional annotation data corresponding to the sub-network structure. Alternatively, the sub-network structure can also be trained based on at least one pixel point containing three-dimensional annotation data corresponding to the sub-network structure and at least one pixel point belonging to a background point corresponding to the sub-network structure. By performing multiple rounds of training on each sub-network structure, a trained target neural network is obtained.

在对目标神经网络进行训练时，可以基于三维标注数据和得到的三维预测数据，确定三维标注数据中每种数据对应的损失值，使用三维标注数据中各种数据对应的损失值的加权和，调整目标神经网络的参数，直至训练后的目标神经网络满足预设要求，比如，直至训练后的目标神经网络的准确度等于设置的准确度阈值，或者，直至训练后的目标神经网络的总损失值小于设置的损失阈值。When training the target neural network, the loss value corresponding to each type of data in the three-dimensional annotation data can be determined based on the three-dimensional annotation data and the obtained three-dimensional prediction data, and the parameters of the target neural network can be adjusted using the weighted sum of the loss values corresponding to various data in the three-dimensional annotation data until the trained target neural network meets preset requirements, for example, until the accuracy of the trained target neural network is equal to the set accuracy threshold, or until the total loss value of the trained target neural network is less than the set loss threshold.

示例性的，针对三维预测数据(或三维标注数据)中待检测对象的目标类别，可以使用焦点损失函数，确定目标类别对应的第一损失。针对待检测对象的属性类别，可以使用softmax分类损失函数，确定属性类别对应的第二损失；以及确定朝向类别对应的第三损失。针对待检测对象的偏移量、像素点对应的待检测对象的中心点的深度、三维检测框的尺寸、三维检测框的朝向、待检测对象的速度，可以分别使用平滑的L1损失函数，确定第四损失。针对中心度，可以使用二元交叉熵(BCE)损失函数，确定第五损失。Exemplarily, for the target category of the object to be detected in the three-dimensional prediction data (or three-dimensional annotation data), the focal loss function can be used to determine the first loss corresponding to the target category. For the attribute category of the object to be detected, the softmax classification loss function can be used to determine the second loss corresponding to the attribute category; and determine the third loss corresponding to the orientation category. For the offset of the object to be detected, the depth of the center point of the object to be detected corresponding to the pixel point, the size of the three-dimensional detection box, the orientation of the three-dimensional detection box, and the speed of the object to be detected, the smooth L1 loss function can be used to determine the fourth loss. For the centrality, the binary cross entropy (BCE) loss function can be used to determine the fifth loss.

一种可选实施方式中，目标神经网络中每个子网络结构对应一个利用样本数据训练得到的回归指标，回归指标用于对与子网络结构相连的检测头网络输出的三维预测数据中的预测回归数据进行倍数调整；In an optional implementation, each sub-network structure in the target neural network corresponds to a regression index obtained by training with sample data, and the regression index is used to adjust the predicted regression data in the three-dimensional predicted data output by the detection head network connected to the sub-network structure by multiples;

其中，检测头网络包括分类网络和回归网络，检测头网络用于基于每个子网络结构输出的特征图，确定子网络结构对应的三维预测数据，三维预测数据包括分类网络输出的预测类别数据、和回归网络输出的预测回归数据。Among them, the detection head network includes a classification network and a regression network. The detection head network is used to determine the three-dimensional prediction data corresponding to the sub-network structure based on the feature map output by each sub-network structure. The three-dimensional prediction data includes the prediction category data output by the classification network and the prediction regression data output by the regression network.

参见图5所示的一种目标神经网络的架构示意图，该目标神经网络的结构可以包括主干网络、金字塔网络、检测头网络，其中，金字塔网络中包括多个不同子网络结构，不同子网络结构对应的输入特征的尺寸和/或输出特征的尺寸不同。其中，每个子网络结构对应一个检测头网络，即各个子网络结构连接的检测头网络共享网络参数。Referring to the schematic diagram of the architecture of a target neural network shown in FIG5 , the structure of the target neural network may include a backbone network, a pyramid network, and a detection head network, wherein the pyramid network includes a plurality of different sub-network structures, and the sizes of input features and/or output features corresponding to different sub-network structures are different. Each sub-network structure corresponds to a detection head network, that is, the detection head networks connected to the various sub-network structures share network parameters.

示例性的，样本图像为训练样本中的任一图像，或者，也可以为待检测图像。在将样本图像输入至目标神经网络之后，可以先使用至少一层卷积层，对样本图像进行特征提取，得到样本图像对应的特征图，再将得到的特征图输入至主干网络中，进行预测，得到三维预测数据。Exemplarily, the sample image is any image in the training sample, or it can also be an image to be detected. After the sample image is input into the target neural network, at least one convolution layer can be used to extract features of the sample image to obtain a feature map corresponding to the sample image, and then the obtained feature map is input into the backbone network for prediction to obtain three-dimensional prediction data.

在具体实施时，为了减少部署目标神经网络的设备的显存资源的消耗，可以将对样本图像进行特征提取的卷积层的参数设置为较小的特征值。为了兼顾目标神经网络效率和精度，目标神经网络的主干网络中可以使用ResNet101和可变形卷积。比如，可以将ResNet101的主干网络中的一个、多个卷积层设置为可变形卷积。In specific implementation, in order to reduce the consumption of video memory resources of the device deploying the target neural network, the parameters of the convolution layer for feature extraction of the sample image can be set to a smaller feature value. In order to balance the efficiency and accuracy of the target neural network, ResNet101 and deformable convolution can be used in the backbone network of the target neural network. For example, one or more convolution layers in the backbone network of ResNet101 can be set to deformable convolution.

检测头网络包括分类网络，该分类网络用于输出预测类别数据，该预测类别数据包括目标类别、属性类别、朝向类别中的一种或多种。检测头网络还包括回归网络，回归网络用户输出预测回归数据，该预测回归数据包括偏移量、深度、三维检测框的尺寸、三维检测框的朝向、待检测对象的速度、中心度中的一种或多种。The detection head network includes a classification network, which is used to output predicted category data, and the predicted category data includes one or more of target category, attribute category, and orientation category. The detection head network also includes a regression network, and the regression network outputs predicted regression data, and the predicted regression data includes one or more of offset, depth, size of the three-dimensional detection box, orientation of the three-dimensional detection box, speed of the object to be detected, and center degree.

由于不同子网络结构用于处理不同尺度的待检测对象，即不同子网络结构输出的预测回归数据的尺寸不同，故每个子网络结构可以对应一个可训练的回归指标，该训练后的回归指标可以用于对与子网络结构相连的检测头网络输出的三维预测数据中的预测回归数据进行倍数调整。比如，若子网络结构一对应的训练后的回归指标为X₁、子网络结构二对应的训练后的回归指标为X₂，则可以将与子网络结构一相连的检测头网络输出的预测回归数据与回归指标X₁相乘，得到子网络结构一对应的预测回归数据；将与子网络结构二相连的检测头网络输出的预测回归数据与回归指标X₂相乘，得到子网络结构二对应的预测回归数据。Since different sub-network structures are used to process objects to be detected of different scales, that is, the sizes of the predicted regression data output by different sub-network structures are different, each sub-network structure can correspond to a trainable regression index, and the trained regression index can be used to adjust the predicted regression data in the three-dimensional predicted data output by the detection head network connected to the sub-network structure. For example, if the trained regression index corresponding to sub-network structure one is _X1 and the trained regression index corresponding to sub-network structure two is _X2 , then the predicted regression data output by the detection head network connected to sub-network structure one can be multiplied by the regression index _X1 to obtain the predicted regression data corresponding to sub-network structure one; the predicted regression data output by the detection head network connected to sub-network structure two can be multiplied by the regression index _X2 to obtain the predicted regression data corresponding to sub-network structure two.

具体实施时，在利用训练样本对目标神经网络进行训练时，可以对该回归指标进行训练，在得到训练后的目标神经网络时，得到了每个子网络结构对应的训练后的回归指标。In specific implementation, when the target neural network is trained using the training samples, the regression index can be trained, and when the trained target neural network is obtained, the trained regression index corresponding to each sub-network structure is obtained.

一种可选实施方式中，在得到训练后的目标神经网络之后，还包括：In an optional implementation, after obtaining the trained target neural network, the method further includes:

S105，获取待检测图像；S105, obtaining an image to be detected;

S106，利用训练后的目标神经网络对待检测图像进行检测，确定待检测图像中包括的至少一个目标对象的三维检测结果。S106, using the trained target neural network to detect the image to be detected, and determining a three-dimensional detection result of at least one target object included in the image to be detected.

待检测图像可以为任一帧图像，将获取的待检测图像输入至训练后的目标神经网络中，利用训练后的目标神经网络对待检测图像进行检测，确定待检测图像中包括的至少一个目标对象的三维检测结果，三维检测结果可以包括以下至少一种检测结果：目标对象对应的三维检测框的尺寸(长宽高)、三维检测框的中心点坐标信息(横轴坐标、纵轴坐标、竖轴坐标)、目标对象的目标类别、属性类别、朝向、朝向类别、速度、中心度、以及置信度。目标对象可以为待检测图像中的任一对象。The image to be detected can be any frame image. The acquired image to be detected is input into the trained target neural network. The trained target neural network is used to detect the image to be detected, and the three-dimensional detection result of at least one target object included in the image to be detected is determined. The three-dimensional detection result may include at least one of the following detection results: the size (length, width and height) of the three-dimensional detection frame corresponding to the target object, the coordinate information of the center point of the three-dimensional detection frame (horizontal axis coordinate, vertical axis coordinate, vertical axis coordinate), the target category, attribute category, orientation, orientation category, speed, centrality, and confidence of the target object. The target object can be any object in the image to be detected.

在S106中，利用目标神经网络对待检测图像进行检测，确定待检测图像中包括的至少一个目标对象的三维检测结果，包括：In S106, the target neural network is used to detect the image to be detected, and a three-dimensional detection result of at least one target object included in the image to be detected is determined, including:

S1061，利用目标神经网络对待检测图像进行检测，生成待检测图像中多个像素点对应的三维检测数据；S1061, using the target neural network to detect the image to be detected, and generating three-dimensional detection data corresponding to multiple pixel points in the image to be detected;

S1062，基于多个像素点分别对应的三维检测数据，确定待检测图像中包括的多个候选三维检测框信息；S1062, determining multiple candidate three-dimensional detection box information included in the image to be detected based on the three-dimensional detection data corresponding to the multiple pixel points respectively;

S1063，基于多个候选三维检测框信息，确定多个候选三维检测框在待检测图像对应的鸟瞰图中的投影框信息；S1063, based on the information of the multiple candidate 3D detection frames, determining the projection frame information of the multiple candidate 3D detection frames in the bird's-eye view corresponding to the image to be detected;

S1064，基于多个投影框信息，确定待检测图像中包括的至少一个目标对象的三维检测结果。S1064: Determine a three-dimensional detection result of at least one target object included in the image to be detected based on the multiple projection frame information.

在S1061中，具体实施时，可以利用目标神经网络对待检测图像进行检测，生成检测图像中每个像素点对应的三维检测数据。再可以根据每个像素点对应的三维检测数据指示的置信度，选取多个像素点对应的三维检测数据。比如，可以设置置信度阈值，从检测图像中各个像素点对应的三维检测数据中，选择对应置信度大于置信度阈值的多个像素点对应的三维检测数据。或者，还可以设置选取的数量阈值，比如，设置选取的数量阈值为100个，从检测图像中各个像素点对应的三维检测数据中，按照置信度从高到低的顺序，选择100个像素点对应的三维检测数据。In S1061, during the specific implementation, the target neural network can be used to detect the image to be detected, and the three-dimensional detection data corresponding to each pixel in the detection image can be generated. Then, the three-dimensional detection data corresponding to multiple pixels can be selected according to the confidence indicated by the three-dimensional detection data corresponding to each pixel. For example, a confidence threshold can be set, and the three-dimensional detection data corresponding to multiple pixels whose corresponding confidence is greater than the confidence threshold can be selected from the three-dimensional detection data corresponding to each pixel in the detection image. Alternatively, a selected quantity threshold can be set, for example, the selected quantity threshold is set to 100, and the three-dimensional detection data corresponding to 100 pixels are selected from the three-dimensional detection data corresponding to each pixel in the detection image in order from high to low confidence.

在S1062中，可以基于多个像素点分别对应的三维检测数据，确定待检测图像中包括的多个候选三维检测框信息。其中，每个像素点的三维检测数据对应一个候选三维检测框信息，候选三维检测框信息包括候选三维检测框的位置信息和尺寸信息。In S1062, based on the three-dimensional detection data corresponding to the multiple pixels, multiple candidate three-dimensional detection frame information included in the image to be detected can be determined, wherein the three-dimensional detection data of each pixel corresponds to a candidate three-dimensional detection frame information, and the candidate three-dimensional detection frame information includes the position information and size information of the candidate three-dimensional detection frame.

示例性的，在基于多个像素点分别对应的三维检测数据，确定待检测图像中包括的多个候选三维检测框信息时，可以将确定的目标对象的中心点的位置信息，通过待检测图像对应的采集设备的内参矩阵，还原至真实场景中，生成目标对象对应的候选三维检测框信息的中心点的在真实场景中的位置信息。Exemplarily, when determining multiple candidate three-dimensional detection frame information included in the image to be detected based on the three-dimensional detection data corresponding to multiple pixel points, the position information of the center point of the determined target object can be restored to the real scene through the intrinsic parameter matrix of the acquisition device corresponding to the image to be detected, so as to generate the position information of the center point of the candidate three-dimensional detection frame information corresponding to the target object in the real scene.

在S1063中，可以针对每个候选三维检测框信息，生成对应在鸟瞰图中的投影框信息。该鸟瞰图中包括多个投影框信息，投影框信息包括投影框的位置和尺寸。In S1063, for each candidate 3D detection frame information, corresponding projection frame information in the bird's-eye view may be generated. The bird's-eye view includes a plurality of projection frame information, and the projection frame information includes the position and size of the projection frame.

在S1064中，基于多个投影框信息，确定待检测图像中包括的至少一个目标对象的三维检测结果，包括：In S1064, based on the multiple projection frame information, a three-dimensional detection result of at least one target object included in the image to be detected is determined, including:

步骤一、基于每个候选三维检测框信息指示的目标类别对应的置信度和中心度，确定与候选三维检测框对应的投影框的目标置信度；其中，中心度用于表征候选三维检测框对应的像素点、与对应的待检测对象的中心点之间的相近程度；Step 1: Based on the confidence and center corresponding to the target category indicated by each candidate 3D detection frame information, determine the target confidence of the projection frame corresponding to the candidate 3D detection frame; wherein the center is used to characterize the degree of proximity between the pixel point corresponding to the candidate 3D detection frame and the center point of the corresponding object to be detected;

步骤二、基于各个投影框分别对应的目标置信度，确定待检测图像中包括的至少一个目标对象的三维检测结果。Step 2: Based on the target confidences corresponding to the projection frames, determine the three-dimensional detection result of at least one target object included in the image to be detected.

在步骤一中，中心度用于表征候选三维检测框对应的像素点、与对应的待检测对象的中心点之间的相近程度，中心度越大，则该像素点距离待检测对象的中心越近，则该像素点的特征信息可靠性较高，生成的该像素点对应的三维检测框信息的可靠性越高；反之，中心度越小，则该像素点距离待检测对象的中心越远，则该像素点的特征信息可靠性较低，生成的该像素点对应的三维检测框信息的可靠性越小。故可以使用三维检测结果指示的像素点的中心度，进行背景点的筛选，避免远离目标对象的中心的像素点的低质量预测，提高检测效率。In step 1, the centrality is used to characterize the closeness between the pixel point corresponding to the candidate three-dimensional detection frame and the corresponding center point of the object to be detected. The larger the centrality, the closer the pixel point is to the center of the object to be detected, the higher the reliability of the feature information of the pixel point, and the higher the reliability of the three-dimensional detection frame information generated for the pixel point; conversely, the smaller the centrality, the farther the pixel point is from the center of the object to be detected, the lower the reliability of the feature information of the pixel point, and the lower the reliability of the three-dimensional detection frame information generated for the pixel point. Therefore, the centrality of the pixel point indicated by the three-dimensional detection result can be used to screen the background points, avoid the low-quality prediction of the pixel point far from the center of the target object, and improve the detection efficiency.

因此，可以基于每个候选三维检测框信息指示的目标类别对应的置信度和中心度，确定与候选三维检测框对应的投影框的目标置信度；再利用目标置信度，较准确的确定待检测图像中包括的至少一个目标对象的三维检测结果。Therefore, based on the confidence and center corresponding to the target category indicated by each candidate three-dimensional detection frame information, the target confidence of the projection frame corresponding to the candidate three-dimensional detection frame can be determined; and then the target confidence can be used to more accurately determine the three-dimensional detection result of at least one target object included in the image to be detected.

示例性的，可以将每个候选三维检测框信息指示的目标类别对应的置信度与中心度相乘，确定与候选三维检测框对应的投影框的目标置信度。Exemplarily, the confidence corresponding to the target category indicated by each candidate 3D detection frame information may be multiplied by the center degree to determine the target confidence of the projection frame corresponding to the candidate 3D detection frame.

在步骤二，可以使用非最大值抑制(Non-Maximum Suppression，NMS)的方式，基于各个投影框分别对应的目标置信度，确定待检测图像中包括的至少一个目标对象的三维检测结果。In step 2, a non-maximum suppression (NMS) method may be used to determine a three-dimensional detection result of at least one target object included in the image to be detected based on target confidences corresponding to each projection frame.

在一种可选实施方式中，还可以设置三维检测结果的第一数量阈值，在利用NMS方式，确定待检测图像中包括的至少一个目标对象的三维检测结果之后，在得到的三维检测结果的数量大于设置的第一数量阈值时，可以根据目标置信度，对得到的三维检测结果进行筛选。In an optional embodiment, a first quantity threshold of the three-dimensional detection results can also be set. After using the NMS method to determine the three-dimensional detection results of at least one target object included in the image to be detected, when the number of obtained three-dimensional detection results is greater than the set first quantity threshold, the obtained three-dimensional detection results can be screened according to the target confidence.

参见图6所示，为本公开实施例所提供的行驶控制方法的流程示意图，该包括：Referring to FIG. 6 , it is a flow chart of a driving control method provided by an embodiment of the present disclosure, which includes:

S601，获取行驶装置在行驶过程中采集的道路图像；S601, acquiring a road image collected by the driving device during driving;

S602，利用上述实施例所述的神经网络训练方法训练得到的目标神经网络，对道路图像进行检测，得到道路图像中包括的目标对象的目标三维位姿数据；S602, using the target neural network trained by the neural network training method described in the above embodiment to detect the road image, and obtain target three-dimensional position data of the target object included in the road image;

S603，基于道路图像中包括的目标对象的目标三维位姿数据，控制行驶装置。S603: Control the traveling device based on the target three-dimensional position data of the target object included in the road image.

示例性的，行驶装置可以为自动驾驶车辆、装有高级驾驶辅助系统(AdvancedDriving Assistance System，ADAS)的车辆、或者机器人等。道路图像可以为行驶装置在行驶过程中实时采集到的图像。目标对象可以为道路中可以能出现的任一物体和/或、任一对象。比如，目标对象可以为出现在道路上的动物、行人等，也可以为道路上的其他车辆(包括机动车辆和非机动车辆)等。Exemplarily, the driving device may be an autonomous vehicle, a vehicle equipped with an Advanced Driving Assistance System (ADAS), or a robot. The road image may be an image collected in real time by the driving device during driving. The target object may be any object and/or any object that may appear on the road. For example, the target object may be an animal, a pedestrian, or other vehicles (including motor vehicles and non-motor vehicles) on the road.

其中，在控制行驶装置时，可以控制行驶装置加速、减速、转向、制动等，或者可以播放语音提示信息，以提示驾驶员控制行驶装置加速、减速、转向、制动等。When controlling the traveling device, the traveling device may be controlled to accelerate, decelerate, turn, brake, etc., or a voice prompt message may be played to prompt the driver to control the traveling device to accelerate, decelerate, turn, brake, etc.

本领域技术人员可以理解，在具体实施方式的上述方法中，各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定，各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art will appreciate that, in the above method of specific implementation, the order in which the steps are written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of the steps should be determined by their functions and possible internal logic.

基于相同的构思，本公开实施例还提供了一种神经网络训练装置，参见图7所示，为本公开实施例提供的神经网络训练装置的架构示意图，包括第一获取模块701、第一确定模块702、第二确定模块703、训练模块704，具体的：Based on the same concept, the embodiment of the present disclosure further provides a neural network training device. Referring to FIG. 7 , it is a schematic diagram of the architecture of the neural network training device provided by the embodiment of the present disclosure, including a first acquisition module 701, a first determination module 702, a second determination module 703, and a training module 704. Specifically:

第一获取模块701，用于获取训练样本；A first acquisition module 701 is used to acquire training samples;

第一确定模块702，用于基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的目标神经网络中的子网络结构；其中，不同子网络结构用于对所述训练样本中不同尺度的所述二维标注框内的像素点进行特征提取；A first determination module 702 is used to determine, based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, a sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box; wherein different sub-network structures are used to extract features of pixel points in the two-dimensional annotation box of different scales in the training sample;

第二确定模块703，用于确定所述训练样本中所述二维标注框内的至少一个像素点对应的、所述待检测对象的三维标注数据；A second determination module 703 is used to determine the three-dimensional annotation data of the object to be detected corresponding to at least one pixel point in the two-dimensional annotation box in the training sample;

训练模块704，用于基于各个所述子网络结构分别对应的、具有所述三维标注数据的至少一个像素点，训练所述目标神经网络中的每个所述子网络结构，得到训练后的目标神经网络。The training module 704 is used to train each of the sub-network structures in the target neural network based on at least one pixel point having the three-dimensional annotation data corresponding to each of the sub-network structures, so as to obtain a trained target neural network.

一种可能的实施方式中，所述第一确定模块702，在基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的子网络结构时，用于：In a possible implementation manner, the first determination module 702, when determining the subnetwork structure matching each pixel point in the two-dimensional annotation box based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, is used to:

一种可能的实施方式中，所述第二确定模块703，在确定所述训练样本中所述二维标注框内的至少一个像素点对应的、所述待检测对象的三维标注数据时，用于：In a possible implementation manner, the second determination module 703, when determining the three-dimensional annotation data of the object to be detected corresponding to at least one pixel point in the two-dimensional annotation box in the training sample, is used to:

一种可能的实施方式中，所述第二确定模块703，在基于与所述二维标注框内像素点分别匹配的所述子网络结构，确定所述训练样本中所述二维标注框内属于前景点的目标像素点时，用于：In a possible implementation manner, the second determination module 703, when determining the target pixel points belonging to the foreground point in the two-dimensional annotation box in the training sample based on the sub-network structure respectively matched with the pixel points in the two-dimensional annotation box, is used to:

一种可能的实施方式中，在基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的目标神经网络中的子网络结构之前，还包括：第三确定模块705，用于：In a possible implementation manner, before determining the sub-network structure in the target neural network matching each pixel point in the two-dimensional annotation box based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, the method further includes: a third determination module 705, which is used to:

一种可能的实施方式中，在得到训练后的目标神经网络之后，还包括：预测模块706，用于：In a possible implementation, after obtaining the trained target neural network, the method further includes: a prediction module 706 for:

获取待检测图像；Acquire the image to be detected;

一种可能的实施方式中，所述预测模块706，在利用所述目标神经网络对所述待检测图像进行检测，确定所述待检测图像中包括的至少一个目标对象的三维检测结果时，用于：In a possible implementation manner, the prediction module 706, when detecting the image to be detected using the target neural network to determine a three-dimensional detection result of at least one target object included in the image to be detected, is configured to:

一种可能的实施方式中，所述预测模块706，在基于多个所述投影框信息，确定所述待检测图像中包括的至少一个目标对象的三维检测结果时，用于：In a possible implementation manner, the prediction module 706, when determining the three-dimensional detection result of at least one target object included in the image to be detected based on the plurality of projection frame information, is configured to:

基于相同的构思，本公开实施例还提供了一种行驶控制装置，参见图8所示，为本公开实施例提供的行驶控制装置的架构示意图，包括第二获取模块801、检测模块802、控制模块803，具体的：Based on the same concept, the embodiment of the present disclosure further provides a driving control device. Referring to FIG. 8 , it is a schematic diagram of the architecture of the driving control device provided by the embodiment of the present disclosure, including a second acquisition module 801, a detection module 802, and a control module 803. Specifically:

第二获取模块801，用于获取行驶装置在行驶过程中采集的道路图像；The second acquisition module 801 is used to acquire the road image collected by the driving device during the driving process;

检测模块802，用于利用本公开提出的神经网络训练方法训练得到的目标神经网络，对所述道路图像进行检测，得到所述道路图像中包括的目标对象的目标三维位姿数据；A detection module 802 is used to detect the road image using a target neural network trained by the neural network training method proposed in the present disclosure to obtain target three-dimensional position data of a target object included in the road image;

控制模块803，用于基于所述道路图像中包括的目标对象的目标三维位姿数据，控制所述行驶装置。The control module 803 is used to control the driving device based on the target three-dimensional position data of the target object included in the road image.

在一些实施例中，本公开实施例提供的装置具有的功能或包含的模板可以用于执行上文方法实施例描述的方法，其具体实现可以参照上文方法实施例的描述，为了简洁，这里不再赘述。In some embodiments, the functions or templates contained in the device provided by the embodiments of the present disclosure can be used to execute the method described in the above method embodiments. The specific implementation can refer to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.

基于同一技术构思，本公开实施例还提供了一种电子设备。参照图9所示，为本公开实施例提供的电子设备的结构示意图，包括处理器901、存储器902、和总线903。其中，存储器902用于存储执行指令，包括内存9021和外部存储器9022；这里的内存9021也称内存储器，用于暂时存放处理器901中的运算数据，以及与硬盘等外部存储器9022交换的数据，处理器901通过内存9021与外部存储器9022进行数据交换，当电子设备900运行时，处理器901与存储器902之间通过总线903通信，使得处理器901在执行以下指令：Based on the same technical concept, an embodiment of the present disclosure also provides an electronic device. Referring to FIG9 , a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure includes a processor 901, a memory 902, and a bus 903. Among them, the memory 902 is used to store execution instructions, including a memory 9021 and an external memory 9022; the memory 9021 here is also called an internal memory, which is used to temporarily store the calculation data in the processor 901, and the data exchanged with the external memory 9022 such as a hard disk. The processor 901 exchanges data with the external memory 9022 through the memory 9021. When the electronic device 900 is running, the processor 901 communicates with the memory 902 through the bus 903, so that the processor 901 executes the following instructions:

获取训练样本；Get training samples;

基于所述训练样本中待检测对象的二维标注框的尺度，确定所述二维标注框内每个像素点匹配的目标神经网络中的子网络结构；其中，不同子网络结构用于对所述训练样本中不同尺度的所述二维标注框内的像素点进行特征提取；Based on the scale of the two-dimensional annotation box of the object to be detected in the training sample, determine the sub-network structure in the target neural network that matches each pixel point in the two-dimensional annotation box; wherein different sub-network structures are used to extract features of pixel points in the two-dimensional annotation box of different scales in the training sample;

基于同一技术构思，本公开实施例还提供了一种电子设备。参照图10所示，为本公开实施例提供的电子设备的结构示意图，包括处理器1001、存储器1002、和总线1003。其中，存储器1002用于存储执行指令，包括内存10021和外部存储器10022；这里的内存10021也称内存储器，用于暂时存放处理器1001中的运算数据，以及与硬盘等外部存储器10022交换的数据，处理器1001通过内存10021与外部存储器10022进行数据交换，当电子设备1000运行时，处理器1001与存储器1002之间通过总线1003通信，使得处理器1001在执行以下指令：Based on the same technical concept, an embodiment of the present disclosure also provides an electronic device. Referring to FIG10 , a schematic diagram of the structure of the electronic device provided in the embodiment of the present disclosure includes a processor 1001, a memory 1002, and a bus 1003. Among them, the memory 1002 is used to store execution instructions, including a memory 10021 and an external memory 10022; the memory 10021 here is also called an internal memory, which is used to temporarily store the calculation data in the processor 1001, and the data exchanged with the external memory 10022 such as a hard disk. The processor 1001 exchanges data with the external memory 10022 through the memory 10021. When the electronic device 1000 is running, the processor 1001 communicates with the memory 1002 through the bus 1003, so that the processor 1001 executes the following instructions:

利用本公开提出的所述的神经网络训练方法训练得到的目标神经网络，对所述道路图像进行检测，得到所述道路图像中包括的目标对象的目标三维位姿数据；Using the target neural network trained by the neural network training method proposed in the present disclosure, the road image is detected to obtain target three-dimensional position data of the target object included in the road image;

此外，本公开实施例还提供一种计算机可读存储介质，该计算机可读存储介质上存储有计算机程序，该计算机程序被处理器运行时执行上述方法实施例中所述的神经网络训练方法、行驶控制方法的步骤。其中，该存储介质可以是易失性或非易失的计算机可读取存储介质。In addition, the embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the neural network training method and the driving control method described in the above method embodiment are executed. The storage medium may be a volatile or non-volatile computer-readable storage medium.

本公开实施例还提供一种计算机程序产品，该计算机程序产品承载有程序代码，所述程序代码包括的指令可用于执行上述方法实施例中所述的神经网络训练方法、行驶控制方法的步骤，具体可参见上述方法实施例，在此不再赘述。The disclosed embodiments also provide a computer program product, which carries a program code. The program code includes instructions that can be used to execute the steps of the neural network training method and the driving control method described in the above method embodiments. For details, please refer to the above method embodiments, which will not be repeated here.

其中，上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中，所述计算机程序产品具体体现为计算机存储介质，在另一个可选实施例中，计算机程序产品具体体现为软件产品，例如软件开发包(Software Development Kit，SDK)等等。The computer program product may be implemented in hardware, software or a combination thereof. In one optional embodiment, the computer program product is implemented as a computer storage medium. In another optional embodiment, the computer program product is implemented as a software product, such as a software development kit (SDK).

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统和装置的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。在本公开所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，又例如，多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, the specific working process of the system and device described above can refer to the corresponding process in the aforementioned method embodiment, and will not be repeated here. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, device and method can be implemented in other ways. The device embodiments described above are merely schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some communication interfaces, and the indirect coupling or communication connection of the device or unit can be electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本公开各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解，本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-OnlyMemory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium that is executable by a processor. Based on this understanding, the technical solution of the present disclosure, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present disclosure. The aforementioned storage medium includes: various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk.

以上仅为本公开的具体实施方式，但本公开的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本公开揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本公开的保护范围之内。因此，本公开的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure, which should be included in the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be based on the protection scope of the claims.

Claims

1. A neural network training method, comprising:

Obtaining a training sample;

Determining a sub-network structure in a target neural network matched with each pixel point in a two-dimensional labeling frame based on the scale of the two-dimensional labeling frame of the object to be detected in the training sample; the method comprises the steps of obtaining a training sample, wherein different sub-network structures are used for extracting characteristics of pixel points in the two-dimensional labeling frame with different scales in the training sample;

Determining three-dimensional labeling data of the object to be detected, which corresponds to at least one pixel point in the two-dimensional labeling frame, in the training sample;

Training each sub-network structure in the target neural network based on at least one pixel point which corresponds to each sub-network structure and has the three-dimensional labeling data, so as to obtain a trained target neural network;

The determining the three-dimensional labeling data of the object to be detected, which corresponds to at least one pixel point in the two-dimensional labeling frame in the training sample, comprises the following steps:

Determining target pixel points belonging to foreground points in the two-dimensional labeling frame in the training sample based on the sub-network structures respectively matched with the pixel points in the two-dimensional labeling frame;

and determining the three-dimensional annotation data of the object to be detected, which corresponds to the target pixel point.

2. The method according to claim 1, wherein the determining the sub-network structure for matching each pixel point in the two-dimensional labeling frame based on the scale of the two-dimensional labeling frame of the object to be detected in the training sample comprises:

Determining the longest first distance between each pixel point and each side of the two-dimensional labeling frame based on the scale of the two-dimensional labeling frame aiming at each pixel point in the two-dimensional labeling frame;

And determining the sub-network structure matched with the pixel point based on the longest first distance and a preset distance range corresponding to each sub-network structure.

3. The method according to claim 1, wherein the determining, based on the sub-network structures respectively matched with the pixels in the two-dimensional labeling frame, the target pixels belonging to the foreground point in the two-dimensional labeling frame in the training sample includes:

determining a second distance between each pixel point in the two-dimensional labeling frame and a center point of the object to be detected in the two-dimensional labeling frame;

determining a distance threshold corresponding to the pixel point based on the stride of the sub-network structure matched with the pixel point and a preset radius parameter;

and under the condition that the second distance is smaller than the distance threshold value, determining the pixel point as the target pixel point.

4. A method according to any one of claims 1 to 3, wherein the three-dimensional annotation data comprises at least one of:

The method comprises the steps of representing offset of deviation between a pixel point and a center point of a corresponding object to be detected, depth of the center point of the object to be detected corresponding to the pixel point, size of a three-dimensional detection frame, orientation of the three-dimensional detection frame, orientation type, speed of the object to be detected, centrality representing similarity degree between the pixel point and the center point of the corresponding object to be detected, target type of the object to be detected and attribute type representing state of the object to be detected.

5. A method according to any one of claims 1-3, further comprising, prior to determining a sub-network structure in the target neural network for which each pixel point within the two-dimensional labeling frame matches based on the dimensions of the two-dimensional labeling frame of the object to be detected in the training sample:

Under the condition that the same pixel point exists in a plurality of two-dimensional labeling frames in the training sample, determining a third distance between the pixel point and the center point of an object to be detected in each two-dimensional labeling frame;

and taking the two-dimensional labeling frame with the minimum corresponding third distance as the two-dimensional labeling frame corresponding to the pixel point.

6. A method according to any one of claims 1 to 3, wherein each sub-network structure in the target neural network corresponds to a regression index trained by the training sample, and the regression index is used for performing multiple adjustment on prediction regression data in three-dimensional prediction data output by a detection head network connected with the sub-network structure;

The detection head network comprises a classification network and a regression network, wherein the detection head network is used for determining three-dimensional prediction data corresponding to the sub-network structure based on a feature map output by each sub-network structure, and the three-dimensional prediction data comprises prediction category data output by the classification network and prediction regression data output by the regression network.

7. A method according to any one of claims 1-3, further comprising, after obtaining the trained target neural network:

Acquiring an image to be detected;

And detecting the image to be detected by using the trained target neural network, and determining a three-dimensional detection result of at least one target object included in the image to be detected.

8. The method according to claim 7, wherein detecting the image to be detected using the target neural network, determining a three-dimensional detection result of at least one target object included in the image to be detected, includes:

Detecting the image to be detected by using the target neural network, and generating three-dimensional detection data corresponding to a plurality of pixel points in the image to be detected;

Determining a plurality of candidate three-dimensional detection frame information included in the image to be detected based on three-dimensional detection data corresponding to the pixel points respectively;

determining projection frame information of a plurality of candidate three-dimensional detection frames in a bird's eye view corresponding to the image to be detected based on the plurality of candidate three-dimensional detection frame information;

And determining a three-dimensional detection result of at least one target object included in the image to be detected based on the plurality of projection frame information.

9. The method according to claim 8, wherein the determining a three-dimensional detection result of at least one target object included in the image to be detected based on a plurality of the projection frame information includes:

Determining the target confidence of a projection frame corresponding to each candidate three-dimensional detection frame based on the confidence and the centrality corresponding to the target category indicated by the information of each candidate three-dimensional detection frame; the centrality is used for representing the similarity degree between the pixel points corresponding to the candidate three-dimensional detection frames and the central points of the corresponding objects to be detected;

and determining a three-dimensional detection result of at least one target object included in the image to be detected based on the target confidence degrees respectively corresponding to the projection frames.

10. A running control method, characterized by comprising:

Acquiring a road image acquired by a running device in the running process;

Detecting the road image by using the target neural network trained by the neural network training method according to any one of claims 1 to 9 to obtain target three-dimensional pose data of a target object included in the road image;

And controlling the driving device based on the target three-dimensional pose data of the target object included in the road image.

11. A neural network training device, comprising:

The first acquisition module is used for acquiring training samples;

The first determining module is used for determining a sub-network structure in a target neural network matched with each pixel point in the two-dimensional labeling frame based on the scale of the two-dimensional labeling frame of the object to be detected in the training sample; the method comprises the steps of obtaining a training sample, wherein different sub-network structures are used for extracting characteristics of pixel points in the two-dimensional labeling frame with different scales in the training sample;

the second determining module is used for determining three-dimensional annotation data of the object to be detected, which corresponds to at least one pixel point in the two-dimensional annotation frame in the training sample;

the training module is used for training each sub-network structure in the target neural network based on at least one pixel point which corresponds to each sub-network structure and has the three-dimensional annotation data, so as to obtain a trained target neural network;

The second determining module is configured to, when determining three-dimensional labeling data of the object to be detected, where the three-dimensional labeling data corresponds to at least one pixel point in the two-dimensional labeling frame, where the three-dimensional labeling data corresponds to at least one pixel point in the training sample: determining target pixel points belonging to foreground points in the two-dimensional labeling frame in the training sample based on the sub-network structures respectively matched with the pixel points in the two-dimensional labeling frame; and determining the three-dimensional annotation data of the object to be detected, which corresponds to the target pixel point.

12. A travel control device characterized by comprising:

the second acquisition module is used for acquiring road images acquired by the driving device in the driving process;

The detection module is used for detecting the road image by using the target neural network trained by the neural network training method according to any one of claims 1 to 9 to obtain target three-dimensional pose data of a target object included in the road image;

and the control module is used for controlling the running device based on the target three-dimensional pose data of the target object included in the road image.

13. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the neural network training method of any one of claims 1 to 9, or the steps of the travel control method of claim 10.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the neural network training method according to any one of claims 1 to 9, or performs the steps of the travel control method according to claim 10.