CN110232418B

CN110232418B - Semantic recognition method, terminal and computer readable storage medium

Info

Publication number: CN110232418B
Application number: CN201910533047.7A
Authority: CN
Inventors: 谭超; 王恺; 廉士国
Original assignee: Cloudminds Robotics Co Ltd
Current assignee: Data Robotics Guangzhou Co ltd
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2021-12-17
Anticipated expiration: 2039-06-19
Also published as: CN110232418A

Abstract

The embodiment of the invention relates to the field of computer vision, and discloses a semantic recognition method, a terminal and a computer-readable storage medium. In some embodiments of the present application, a semantic recognition method includes: acquiring first image data of a scene; obtaining a first semantic recognition result according to the color image data of the first image data and the first network model; the first network model is obtained by training according to the color image data of the first training image and the semantic recognition result of the first training image; obtaining a second semantic recognition result according to the depth data of the first image data and the second network model; the second network model is obtained by training according to the depth data of the second training image and the semantic recognition result of the second training image; and fusing the first semantic recognition result and the second semantic recognition result to obtain a first fusion recognition result of the first image data. In this implementation, the accuracy of the recognition result can be improved.

Description

Semantic recognition method, terminal and computer readable storage medium

Technical Field

The embodiment of the invention relates to the field of computer vision, in particular to a semantic recognition method, a terminal and a computer readable storage medium.

Background

When the robot executes various tasks such as navigation, obstacle avoidance and operation, the robot needs to fully understand the environment where the robot is located, and the semantic segmentation and identification of the environment play a fundamental and vital role.

The inventor finds that how to ensure that the segmented region is an effective region and the segmentation accuracy is the key of the semantic segmentation recognition processing. The segmentation identification technology based on the image is mature and has good identification effect, but the identification result is inaccurate when the samples are insufficient. The scene is described in a point cloud form based on point cloud segmentation identification, and an obtained result is irrelevant to a visual angle, but the prior art is not mature enough, sample data is less, the result is easily influenced by noise, and the problem of continuity exists. Therefore, a more accurate semantic segmentation recognition method is needed.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of embodiments of the present invention is to provide a semantic recognition method, a terminal and a computer-readable storage medium, so that the terminal can more accurately recognize an object in an environment.

In order to solve the above technical problem, an embodiment of the present invention provides a semantic recognition method, including the following steps: acquiring first image data of a scene; obtaining a first semantic recognition result according to the color image data of the first image data and the first network model; the first network model is obtained by training according to the color image data of the first training image and the semantic recognition result of the first training image; obtaining a second semantic recognition result according to the depth data of the first image data and the second network model; the second network model is obtained by training according to the depth data of the second training image and the semantic recognition result of the second training image; and fusing the first semantic recognition result and the second semantic recognition result to obtain a first fusion recognition result of the first image data.

An embodiment of the present invention further provides a terminal, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the semantic recognition method as mentioned in the above embodiments.

The embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to implement the semantic recognition method mentioned in the above embodiment.

Compared with the prior art, the terminal carries out segmentation recognition on the first image data in two modes, and the recognition results of the two recognition modes are fused to determine the final recognition result, so that when one recognition method fails to recognize or mistakenly recognizes the object in the scene where the terminal is located, or part of the object, the other recognition method can make up the object, and the final recognition result is more accurate.

In addition, the fusing the first semantic recognition result and the second semantic recognition result to obtain a first fused recognition result of the first image data specifically includes: mapping the first semantic recognition result to a second semantic recognition result to obtain a third semantic recognition result; determining the semantic recognition result of each first three-dimensional point according to the candidate result set of each first three-dimensional point in the third semantic recognition result; and determining a first fusion recognition result of the first image data according to the semantic recognition result of each first three-dimensional point. In the implementation, the semantic recognition results are fused for each point, so that the accuracy of the fusion recognition results is improved.

In addition, determining the semantic recognition result of each first three-dimensional point according to the candidate result set of each first three-dimensional point in the third semantic recognition result specifically includes: and aiming at each first three-dimensional point in the third semantic recognition result, respectively performing the following operations: and selecting one candidate result from the candidate result set as a semantic recognition result of the first three-dimensional point according to the confidence degree of the candidate results in the candidate result set of the first three-dimensional point. In this implementation, the final recognition result is selected based on the confidence of the recognition result, so that the accuracy of the recognition result is higher.

In addition, the candidate result set of the first three-dimensional point includes a first candidate result and a second candidate result; the first candidate result is determined according to the first semantic recognition result, and the second candidate result is determined according to the second semantic recognition result; according to the confidence degree of the candidate results in the candidate result set of the first three-dimensional point, selecting one candidate result from the candidate result set as the semantic recognition result of the first three-dimensional point, which specifically includes: determining a first confidence of the first candidate result, a second confidence of a second candidate result, a scaling factor for the first confidence, and a scaling factor for the second confidence; judging whether the product of the first confidence coefficient and the scale factor of the first confidence coefficient is larger than the product of the second confidence coefficient and the scale factor of the second confidence coefficient; if yes, taking the first candidate result as a semantic recognition result of the first three-dimensional point; and if not, taking the second candidate result as a semantic recognition result of the first three-dimensional point. In the implementation, different scale factors are set for the first network model and the second network model according to different conditions, so that different network models can be selected according to different scenes, and the flexibility and the universality of the semantic identification method are improved.

In addition, after the first semantic recognition result and the second semantic recognition result are fused to obtain a first fused recognition result of the first image data, the semantic recognition method further includes: acquiring a second fusion recognition result of the second image data; the second image data is image data of a frame previous to the first image data; and determining a second fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data, wherein the second fusion recognition result is a final recognition result of the first image data. In the implementation, inter-frame fusion is performed on the recognition result, so that the final recognition result is more accurate.

In addition, the first image data is the image data of the Nth frame image; n is a positive integer; determining a second fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data, specifically comprising: determining a candidate fusion recognition result set of each first three-dimensional point in the first fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data; for each first three-dimensional point of a first fusion recognition result of the first image data, selecting one candidate fusion recognition result from the candidate fusion recognition result set of the first three-dimensional points as a fusion recognition result of the first three-dimensional point; and determining a second fusion recognition result of the first image data according to the fusion recognition result of each first three-dimensional point.

In addition, the first image data is image data of an Nth frame image, and N is a positive integer larger than a preset value; determining a second fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data, specifically comprising: acquiring optical flow information of the first image data and the second image data; for each first three-dimensional point in the first fusion recognition result of the first image data, respectively performing the following operations: judging whether a second three-dimensional point and a first three-dimensional point of a second fusion recognition result belong to the same object or not according to the optical flow information; the second three-dimensional point has the same coordinate with the first three-dimensional point; if the judgment result is yes, taking the fusion recognition result of the corresponding second three-dimensional point as the fusion recognition result of the first three-dimensional point; if the judgment result is not yes, determining the fusion recognition result of the first three-dimensional point according to the confidence coefficient of the fusion recognition result of the corresponding second three-dimensional point and the confidence coefficient of the fusion recognition result of the first three-dimensional point; and determining a second fusion recognition result of the first image data according to the fusion recognition result of each first three-dimensional point.

In addition, before operating on each first three-dimensional point in the first fusion recognition result of the first image data, the semantic recognition method further includes: acquiring the confidence coefficient of the optical flow information; determining that the confidence of the optical flow information is greater than a threshold.

In addition, the determining optical flow information of the first image data and the second image data specifically includes: obtaining optical flow information and confidence of the optical flow information according to the first image data, the second image data and the third network model; wherein the parameters in the third network model are determined based on the third training image data, the fourth training image data, and the optical flow information of the third training image data and the fourth training image data.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart of a semantic identification method of a first embodiment of the present invention;

FIG. 2 is a flow chart of an interframe fusion method according to a second embodiment of the invention;

FIG. 3 is a flow chart illustrating a semantic recognition method according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a semantic recognition device according to a third embodiment of the present invention;

fig. 5 is a schematic configuration diagram of a terminal according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The first embodiment of the invention relates to a semantic recognition method, which is applied to a terminal, such as a robot or other machine equipment. As shown in fig. 1, the semantic recognition method according to the present embodiment includes the following steps:

step 101: first image data of a scene is acquired.

Specifically, the first image data includes color (Red-Green-Blue, RGB) image data and Depth (Depth) data. The RGB image data can be collected by a color camera, Depth data can be collected by a Depth camera, and the Depth data can be calculated according to the RGB image data of two or more color cameras. The terminal may align the RGB image data and the Depth data when acquiring the two.

Step 102: and obtaining a first semantic recognition result according to the color image data of the first image data and the first network model.

Specifically, the first network model is a network model for performing segmentation recognition on an image, and is obtained by training according to color image data of a first training image and a semantic recognition result of the first training image. Inputting the RGB image data into a first network model, and performing semantic segmentation on the RGB image data by using the first network model to obtain a first semantic recognition result, wherein the first semantic recognition result is an image obtained by performing semantic segmentation on the color image data.

In one example, the first network model is a convolutional neural network model, and the terminal trains the convolutional neural network model by using the labeled training image to obtain related parameters in the convolutional neural network, so that the trained convolutional neural network model can perform semantic segmentation on the color image.

Step 103: and obtaining a second semantic recognition result according to the depth data of the first image data and the second network model.

Specifically, the second network model is obtained by training according to the depth data of the second training image and the semantic recognition result of the second training image.

In one example, the second network model includes a conversion submodel and an identification submodel. The terminal inputs the depth data of the first image data into the conversion sub-model, the conversion sub-model obtains point cloud data of a scene according to the depth data, and the point cloud data are input into the identification sub-model. The identifier model carries out segmentation processing on the point cloud data to obtain a corresponding second semantic recognition result, and the second semantic recognition result is the point cloud data after segmentation recognition.

In one example, the identification process based on the point cloud mainly utilizes a conditional random field to train the three-dimensional characteristics of the point cloud, firstly performs down-sampling on the point cloud to generate a multi-scale neighborhood for fast searching and extracting the three-dimensional characteristics, and then utilizes a random forest classifier to classify, thereby realizing semantic segmentation of the point cloud.

In one example, the conversion sub-model converts the depth data into three-dimensional point cloud data according to formula a.

Formula a:

in formula a, x represents an abscissa value of a three-dimensional point in the point cloud data, y represents an ordinate value of the three-dimensional point in the point cloud data, and Z represents a depth of the three-dimensional point in the point cloud dataA value u represents the abscissa value of the two-dimensional point in the coordinate system of the color image, v represents the ordinate value of the two-dimensional point in the coordinate system of the color image, u₀Abscissa value, v, representing the center point of a color image₀Ordinate values representing the center point of the color image, f_xAnd f_yRepresenting the image sensor internal parameters, corresponding to the focal length of the image sensor.

In the present embodiment, step 103 is set as a step subsequent to step 102 for clarity of description. However, those skilled in the art will understand that, in practical applications, step 103 may be executed before step 102, or may be executed simultaneously with step 102, and this embodiment is not limited in this respect.

Step 104: and fusing the first semantic recognition result and the second semantic recognition result to obtain a first fusion recognition result of the first image data.

Specifically, the first semantic recognition result is obtained by performing segmentation recognition on the basis of image data, the second semantic recognition result is obtained by performing segmentation recognition on the basis of 3D point cloud data, and the terminal fuses the recognition results obtained by the two methods to obtain a final first fusion recognition result.

According to the above, the terminal performs segmentation recognition on the first image data in two ways, and determines the final recognition result based on the recognition results obtained in the two recognition ways, so that the final recognition result is more accurate compared with a method for determining the recognition result by using only one recognition method.

In one example, the method for the terminal to determine the final first fused recognition result is as follows: the terminal maps the first semantic recognition result to the second semantic recognition result to obtain a third semantic recognition result; determining the semantic recognition result of each first three-dimensional point according to the candidate result set of each first three-dimensional point in the third semantic recognition result; and determining a first fusion recognition result of the first image data according to the semantic recognition result of each first three-dimensional point.

The mapping process of the first semantic recognition result is exemplified below. The first semantic recognition result comprises two-dimensional points in the color image data and two-dimensional recognition results of the two-dimensional points, and the second semantic recognition result comprises three-dimensional points in the point cloud data and three-dimensional recognition results of the three-dimensional points. For a two-dimensional point in the color image data, the terminal maps the two-dimensional point to a second semantic recognition result (namely point cloud data after semantic segmentation) through the mapping relation between the two-dimensional image and the point cloud data to obtain a mapped three-dimensional point; if the mapped three-dimensional point coincides with an original three-dimensional point (i.e., a three-dimensional point created by the depth data) in the point cloud data, it is determined that the candidate result set of the three-dimensional point coinciding with the mapped three-dimensional point includes a three-dimensional recognition result (second candidate result) based on the second semantic recognition result and a two-dimensional recognition result (first candidate result) of the two-dimensional point. If the mapped three-dimensional point does not coincide with the original three-dimensional point in the point cloud data, updating the point cloud data to include the mapped three-dimensional point, and determining that the candidate result set of the mapped three-dimensional point includes the two-dimensional recognition result (first candidate result) of the two-dimensional point.

According to the content, the terminal maps the semantic recognition result based on the color image data to the semantic recognition result based on the depth data, so that the semantic recognition result based on the depth data can be enriched, a more accurate semantic recognition result can be selected from the semantic recognition result based on the color image data and the semantic recognition result based on the depth data, and the accuracy of the semantic recognition result is improved.

In one example, the mapping relationship between the point cloud data and the color image data is shown in the following formula b, and the point cloud data and the color image data can be associated based on the formula b.

Formula b:

in the formula b, Z_cA Z-axis value representing an image sensor (e.g., camera) on the terminal, i.e., a distance of an object from the image sensor, u represents an abscissa value of a two-dimensional point in a coordinate system of the color image, v represents an ordinate value of the two-dimensional point in the coordinate system of the color image, u represents an ordinate value of the two-dimensional point in the coordinate system of the color image, and u represents a Z-axis value of the image sensor (e.g., camera) on the terminal₀Representing coloursAbscissa value of center point of image, v₀Ordinate values representing the center point of the color image, R representing a 3 x 3 rotation matrix of the extrinsic matrix of the image sensor, T representing a 3 x 1 translation matrix of the extrinsic matrix of the image sensor, f_xAnd f_yRepresenting an internal reference of the image sensor, corresponding to the focal length, X, of the image sensor_WAbscissa values, Y, representing three-dimensional points in the point cloud data_WOrdinate values, Z, representing three-dimensional points in the point cloud data_WDepth values representing three-dimensional points in the point cloud data.

In one example, the terminal determines the semantic recognition result of each first three-dimensional point according to the candidate result set of each first three-dimensional point in the third semantic recognition result by the following method: and aiming at each first three-dimensional point in the third semantic recognition result, respectively performing the following operations: and selecting one candidate result from the candidate result set as a semantic recognition result of the first three-dimensional point according to the confidence degree of the candidate results in the candidate result set of the first three-dimensional point.

It should be noted that, as can be understood by those skilled in the art, in practical applications, the final semantic recognition result may be selected in other manners, and the method for selecting the final semantic recognition result is not limited in the present embodiment.

It is worth mentioning that the terminal selects the final semantic recognition result according to the first confidence degree of the first semantic recognition result and the second confidence degree of the second semantic recognition result, so that the confidence degree of the final semantic recognition result is more guaranteed and better conforms to the actual scene.

In one example, the set of candidate results for the first three-dimensional point includes a first candidate result and a second candidate result; the first candidate result is determined according to the first semantic recognition result, and the second candidate result is determined according to the second semantic recognition result. The terminal determines a first confidence coefficient of the first candidate result, a second confidence coefficient of the second candidate result, a scaling factor of the first confidence coefficient and a scaling factor of the second confidence coefficient; judging whether the product of the first confidence coefficient and the scale factor of the first confidence coefficient is larger than the product of the second confidence coefficient and the scale factor of the second confidence coefficient; if yes, taking the first candidate result as a semantic recognition result of the first three-dimensional point; and if not, taking the second candidate result as a semantic recognition result of the first three-dimensional point. Wherein the scaling factor of the first confidence level may be a scaling factor of the first network model and the scaling factor of the second confidence level may be a scaling factor of the second network model.

It should be noted that, as can be understood by those skilled in the art, in practical applications, a strategy for determining a semantic recognition result of a point may be selected as needed, and the method for selecting a semantic recognition result of a three-dimensional point from a candidate result set is not limited in the present embodiment.

It should be noted that, as will be understood by those skilled in the art, the scale factor of the first network model and the scale factor of the second network model may be set according to the sizes of the color image data set and the depth data set, and the complexity of the scene, for example, when there is more color image data enough to support the recognition of objects at different viewing angles, the confidence of the recognition result of the first network model is necessarily higher, the scale factor of the first network model may be set to 0.7, and the scale factor of the second network model may be set to 0.3; for another example, when the color image data is less, but the depth data has higher quality, and the generated point cloud has better quality, and the actual object can be better restored, the confidence of the recognition result of the second network model may be higher, the scale factor of the first network model may be set to 0.4 and the scale factor of the second network model may be set to 0.6, and the embodiment does not limit the specific values of the scale factor of the first network model and the scale factor of the second network model.

It should be mentioned that, the scale factors are set for the first network model and the second network model, so that developers can set different scale factors for different recognition methods according to the interference degree and accuracy of the image data-based segmentation recognition method and the 3D point cloud data-based segmentation recognition method in different scenes, so that the terminal can select the segmentation recognition method more suitable for the scene to recognize the scene where the terminal is located, and the accuracy of the recognition result is further ensured.

The above description is only for illustrative purposes and does not limit the technical aspects of the present invention.

Compared with the prior art, in the semantic identification method provided by the embodiment, the terminal adopts two identification modes to perform segmentation identification on the first image data, and meanwhile, the advantages of the two identification modes are combined, so that the scene can be identified from a three-dimensional angle in a point cloud mode, and the scene can be further identified by utilizing texture information of the image. The terminal determines the final recognition result by fusing the recognition results of the two recognition modes, so that when one recognition method fails to recognize or mistakenly recognizes the object in the scene where the terminal is located, the other recognition method can make up for the object, and the final recognition result is more accurate.

A second embodiment of the present invention relates to a semantic recognition method. The embodiment is further improved on the basis of the first embodiment, and the specific improvements are as follows: after the first fusion recognition result is determined, the first fusion recognition result is subjected to inter-frame fusion with a semantic recognition result based on other image data.

Specifically, as shown in fig. 2, in the present embodiment, the method for inter-frame fusion includes the following steps:

step 201: and acquiring a first fusion recognition result and a second fusion recognition result of the second image data.

Specifically, the first fused recognition result is a fused recognition result of the first semantic recognition result and the second semantic recognition result of the first image data. The second image data is image data of a frame previous to the first image data. The second fusion recognition result of the second image data is a fusion recognition result of the first fusion recognition result of the second image data and the final recognition result of the image data of the previous frame of the second image data.

Step 202: and determining a second fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data.

Specifically, the second fused recognition result is the final recognition result of the first image data. And when the terminal carries out semantic recognition on the Nth frame of image, the terminal completes the first fusion recognition result of the Nth frame of image by referring to the semantic recognition result of the previous N-1 frame of image so as to further improve the accuracy of the final recognition result of the Nth frame of image.

The following exemplifies a method for obtaining the second fusion recognition result of the nth frame image by the terminal with reference to the semantic recognition result of the previous N-1 frame image.

The method comprises the following steps: the terminal determines a candidate fusion recognition result set of each first three-dimensional point in a first fusion recognition result of the first image data according to the first fusion recognition result of the first image data and a second fusion recognition result of the second image data; for each first three-dimensional point of a first fusion recognition result of the first image data, selecting one candidate fusion recognition result from the candidate fusion recognition result set of the first three-dimensional points as a fusion recognition result of the first three-dimensional point; and determining a second fusion recognition result of the first image data according to the fusion recognition result of each first three-dimensional point.

In one example, if the number of candidate fusion recognition results in the candidate fusion recognition result set is greater than 1, the terminal selects the candidate fusion recognition result with the highest confidence as the fusion recognition result of the first three-dimensional point according to the confidence of each candidate fusion recognition result.

If, in the inter-frame fusion process, a first fusion identification result of the first image data indicates that a certain three-dimensional point in the point cloud data is a first object, and a second fusion identification result of the second image data indicates that the three-dimensional point is a second object, the candidate fusion identification result set of the three-dimensional point includes a first candidate fusion identification result and a second candidate fusion identification result, the first candidate fusion identification result indicates that the three-dimensional point is the first object, and the second candidate fusion identification result indicates that the three-dimensional point is the second object. The terminal judges whether the confidence coefficient of the first candidate fusion recognition result is greater than or equal to the confidence coefficient of the second candidate fusion recognition result, if yes, the fusion recognition result of the three-dimensional point is determined to be the first candidate fusion recognition result; and if not, determining that the fusion recognition result of the three-dimensional point is a second candidate fusion recognition result. If the first fusion recognition result of the first image data indicates that a certain three-dimensional point in the point cloud data is a first object and the second fusion recognition result of the second image data indicates that the three-dimensional point is the first object, the candidate fusion recognition result set only includes the first candidate fusion recognition result, and the fusion recognition result of the three-dimensional point is the first candidate fusion recognition result. And if a certain three-dimensional point exists in the first fusion recognition result of the first image data and the three-dimensional point does not exist in the second fusion recognition result of the second image data, the fusion recognition result of the three-dimensional point is the fusion recognition result of the three-dimensional point in the first image data. And if a certain three-dimensional point does not exist in the first fusion recognition result of the first image data and the three-dimensional point exists in the second fusion recognition result of the second image data, the fusion recognition result of the three-dimensional point is the fusion recognition result of the three-dimensional point in the second image data.

The method 2 comprises the following steps: and the terminal selects different processing modes according to whether the N is greater than a preset value.

When N is smaller than or equal to a preset value, the terminal determines a candidate fusion recognition result set of each first three-dimensional point in the first fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data; for each first three-dimensional point of a first fusion recognition result of the first image data, selecting one candidate fusion recognition result from the candidate fusion recognition result set as a fusion recognition result of the first three-dimensional point according to the confidence degree of the candidate fusion recognition result in the candidate fusion recognition result set of the first three-dimensional point; and determining a second fusion recognition result of the first image data according to the fusion recognition result of each first three-dimensional point. The process of determining the second fused recognition result in this case may refer to the related description of method 1.

When N is larger than a preset value, the terminal acquires optical flow information of the first image data and the second image data; for each first three-dimensional point in the first fusion recognition result of the first image data, respectively performing the following operations: judging whether a second three-dimensional point and a first three-dimensional point of a second fusion recognition result belong to the same object or not according to the optical flow information; the second three-dimensional point has the same coordinate with the first three-dimensional point; if the judgment result is yes, taking the fusion recognition result of the corresponding second three-dimensional point as the fusion recognition result of the first three-dimensional point; if the judgment result is not yes, determining the fusion recognition result of the first three-dimensional point according to the confidence coefficient of the fusion recognition result of the corresponding second three-dimensional point and the confidence coefficient of the fusion recognition result of the first three-dimensional point; and determining a second fusion recognition result of the first image data according to the fusion recognition result of each first three-dimensional point.

In one example, the terminal acquires the confidence of the optical flow information before operating on each first three-dimensional point in the first fusion recognition result of the first image data; determining that the confidence of the optical flow information is greater than a threshold.

It should be noted that, in practical applications, the threshold and the preset value may be set as needed, for example, the preset value may be set to any positive integer greater than 1, such as a positive integer greater than 10, and the threshold may be set to any percentage greater than 50%, for example, may be set to 60%, 70%, and so on.

In the process of inter-frame fusion, the terminal acquires optical flow information of the first image data and the second image data and confidence of the optical flow information, and if the confidence of the optical flow information is greater than a threshold value, a second fusion recognition result of the first image data is obtained by combining the optical flow information. Specifically, if a first three-dimensional point in a first fusion recognition result of the first image data and a second three-dimensional point in a second fusion recognition result of the second image data have the same coordinates, the first optical flow information indicates that the first three-dimensional point and the second three-dimensional point are three-dimensional points belonging to the same object, the fusion recognition result of the first three-dimensional point indicates that the three-dimensional point is a first object, the second fusion recognition result of the second image data indicates that the three-dimensional point is a second object, and the terminal determines that the fusion recognition result of the three-dimensional point indicates that the three-dimensional point is the second object. If the optical flow information indicates that the first three-dimensional point and the second three-dimensional point do not belong to the same object, the fusion recognition result of the three-dimensional points can be determined with reference to the related content of method 1. And if the confidence of the optical flow information is smaller than the threshold, referring to the related content of the method 1, and obtaining a second fusion recognition result of the first image data.

It is worth mentioning that by analyzing the optical flow information, under the condition of considering the consistency of the first image data and the second image data, the first fusion recognition result of the first image data is perfected by combining the second fusion recognition result of the second image data, so as to obtain a more accurate final recognition result.

In one example, the terminal may determine optical flow information for the first image data and the second image data through a third network model. The terminal obtains optical flow information and confidence of the optical flow information according to the first image data, the second image data and the third network model; wherein the parameters in the third network model are determined based on the third training image data, the fourth training image data, and the optical flow information of the third training image data and the fourth training image data.

The third network model may be a deep learning-based network model or a network model based on other principles, and the specific type of the third network model is not limited in this embodiment.

In one example, the third network model is a convolutional neural network model in the form of F ═ CNN (θ, I1, I2), where θ can be obtained by training learning for adjusting the convolutional neural network so that the network has the function of optical flow prediction, and I1 and I2 are color images of two adjacent frames.

In an example, a flow diagram of the semantic recognition method according to the present embodiment is shown in fig. 3, where a terminal first recognizes color image data and depth data in initial frame image data, and determines an intra-frame fusion recognition result, i.e., a first fusion recognition result, based on a first recognition result of the color image data and a second recognition result based on the depth data. The terminal determines optical flow information of the first image data and the second image data based on the color image data in the first image data and the color image data in the second image data, and perfects a first fusion recognition result of the first image data by utilizing a first fusion recognition result of the second image data based on the optical flow information to obtain a more accurate second fusion recognition result of the first image data.

Compared with the prior art, the semantic recognition method provided in this embodiment can use motion estimation, motion compensation, etc. to link moving objects between different frames, since the optical flow information can be used to represent the motion of the objects. The optical flow information has rich motion information, so that the motion of the object can be predicted through the optical flow information, and the information between different frames is combined and complemented, thereby generating a more accurate recognition result.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A third embodiment of the present invention relates to a semantic recognition apparatus, as shown in fig. 4, including: an acquisition module 401, a first identification module 402, a second identification module 403 and a determination module 404. The acquiring module 401 is configured to acquire first image data of a scene; the first identification module 402 is configured to obtain a first semantic identification result according to the color image data of the first image data and the first network model; the first network model is obtained by training according to the color image data of the first training image and the semantic recognition result of the first training image; the second identification module 403 is configured to obtain a second semantic identification result according to the depth data of the first image data and the second network model; the second network model is obtained by training according to the depth data of the second training image and the semantic recognition result of the second training image; the determining module 404 is configured to fuse the first semantic recognition result and the second semantic recognition result to obtain a first fused recognition result of the first image data.

It should be understood that this embodiment is a system example corresponding to the first embodiment, and may be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A fifth embodiment of the present invention relates to a terminal, as shown in fig. 5, including: at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; the memory 502 stores instructions executable by the at least one processor 501, and the instructions are executed by the at least one processor 501, so that the at least one processor 501 can execute the semantic recognition method according to the above embodiments.

The terminal includes: one or more processors 501 and a memory 502, with one processor 501 being an example in fig. 5. The processor 501 and the memory 502 may be connected by a bus or other means, and fig. 5 illustrates the connection by the bus as an example. Memory 502, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 501 executes various functional applications of the device and data processing by running non-volatile software programs, instructions, and modules stored in the memory 502, that is, implements the above semantic recognition method.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 502 may optionally include memory located remotely from processor 501, which may be connected to an external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory 502 and when executed by the one or more processors 501 perform the semantic recognition method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, has corresponding functional modules and beneficial effects of the execution method, and can refer to the method provided by the embodiment of the application without detailed technical details in the embodiment.

A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method of semantic identification, comprising:

acquiring first image data of a scene;

obtaining a first semantic recognition result according to the color image data of the first image data and a first network model; the first network model is obtained by training according to color image data of a first training image and a semantic recognition result of the first training image;

obtaining a second semantic recognition result according to the depth data of the first image data and a second network model; the second network model is obtained by training according to the depth data of a second training image and the semantic recognition result of the second training image;

fusing the first semantic recognition result and the second semantic recognition result to obtain a first fused recognition result of the first image data;

the fusing the first semantic recognition result and the second semantic recognition result to obtain a first fused recognition result of the first image data specifically includes:

mapping the first semantic recognition result to the second semantic recognition result to obtain a third semantic recognition result;

determining a semantic recognition result of each first three-dimensional point according to the candidate result set of each first three-dimensional point in the third semantic recognition result;

and determining a first fusion recognition result of the first image data according to the semantic recognition result of each first three-dimensional point.

2. The semantic recognition method according to claim 1, wherein the determining the semantic recognition result of each first three-dimensional point according to the candidate result set of each first three-dimensional point in the third semantic recognition result specifically comprises:

for each first three-dimensional point in the third semantic recognition result, respectively performing the following operations:

and selecting a candidate result from the candidate result set as a semantic recognition result of the first three-dimensional point according to the confidence degree of the candidate result in the candidate result set of the first three-dimensional point.

3. The semantic recognition method according to claim 2, wherein the candidate result set of the first three-dimensional point includes a first candidate result and a second candidate result; wherein the first candidate result is determined according to the first semantic recognition result, and the second candidate result is determined according to the second semantic recognition result;

the selecting, according to the confidence of the candidate result in the candidate result set of the first three-dimensional point, one candidate result from the candidate result set as the semantic recognition result of the first three-dimensional point specifically includes:

determining a first confidence of the first candidate result, a second confidence of a second candidate result, a scaling factor for the first confidence, and a scaling factor for the second confidence;

judging whether the product of the first confidence coefficient and the scale factor of the first confidence coefficient is larger than the product of the second confidence coefficient and the scale factor of the second confidence coefficient;

if yes, taking the first candidate result as a semantic recognition result of the first three-dimensional point;

and if not, taking the second candidate result as a semantic recognition result of the first three-dimensional point.

4. The semantic recognition method according to any one of claims 1 to 3, further comprising, after the fusing the first semantic recognition result and the second semantic recognition result to obtain a first fused recognition result of the first image data:

acquiring a final recognition result of the second image data; the second image data is image data of a frame previous to the first image data;

and determining a second fusion recognition result of the first image data according to a first fusion recognition result of the first image data and a second fusion recognition result of the second image data, wherein the second fusion recognition result is a final recognition result of the first image data.

5. The semantic recognition method according to claim 4, wherein the first image data is image data of an nth frame image; n is a positive integer;

the determining a second fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data specifically includes:

determining a candidate fusion recognition result set of each first three-dimensional point in the first fusion recognition result of the first image data according to the first fusion recognition result of the first image data and the second fusion recognition result of the second image data;

for each first three-dimensional point of a first fusion recognition result of the first image data, selecting one candidate fusion recognition result from the candidate fusion recognition result set of the first three-dimensional point as a fusion recognition result of the first three-dimensional point;

and determining a second fusion recognition result of the first image data according to the fusion recognition result of each first three-dimensional point.

6. The semantic recognition method according to claim 4, wherein the first image data is image data of an Nth frame image, N being a positive integer greater than a preset value;

acquiring optical flow information of the first image data and the second image data;

for each first three-dimensional point in the first fusion recognition result of the first image data, respectively performing the following operations: judging whether a second three-dimensional point of the second fusion recognition result and the first three-dimensional point belong to the same object or not according to the optical flow information; wherein the second three-dimensional point has the same coordinates as the first three-dimensional point; if so, taking the fusion recognition result of the corresponding second three-dimensional point as the fusion recognition result of the first three-dimensional point; if not, determining the fusion recognition result of the first three-dimensional point according to the confidence coefficient of the fusion recognition result of the corresponding second three-dimensional point and the confidence coefficient of the fusion recognition result of the first three-dimensional point;

7. The semantic recognition method according to claim 6, further comprising, before operating on each first three-dimensional point in the first fused recognition result of the first image data:

obtaining the confidence of the optical flow information;

determining that a confidence of the optical flow information is greater than a threshold.

8. A terminal, comprising: at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the semantic recognition method of any one of claims 1 to 7.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the semantic recognition method according to any one of claims 1 to 7.