Vineet Suryan
April 05, 2023
Reading time:
Extracting precise 3D object information is one of the prime goals for comprehensive scene understanding. However, labeling errors are common in present open-source 3D perception datasets, which could have impactful consequences. To tackle this issue, we used Carlafox to automatically generate an error-free synthetic dataset for 3D perception.
Deep 3D object detectors may become confused during training due to the inherent ambiguity in ground-truth annotations of 3D bounding boxes brought on by occlusions, missing, or manual annotation errors, which lowers the detection accuracy. However, existing methods overlook such issues to some extent and treat the labels as deterministic. It is possible to create enormous datasets without cost by using a virtual simulation with known labels. According to research, using both simulated and real data helps AI models become more accurate. Our results show that simulated data can significantly reduce the amount of training on real data required to achieve satisfactory levels of accuracy.
Carlafox, a web-based CARLA visualizer, substantially demystifies the arduous task of synthetic dataset generation for 3D object detection. We use Carlafox to set up sensor configurations, create diverse weather conditions, and generate data from different maps in the KITTI format. One of the advantages of the dataset is that the open-source CARLA simulator was used to recreate the same LiDAR and camera configurations used to generate the original KITTI data.
The objective is to offer a challenging dataset to assess and enhance approaches in complicated vision tasks, such as 3D object detection. In total, the dataset has 12807 cars, 10252 pedestrians, and 11624 cyclists. The dataset contains 2D and 3D bounding box annotations of the classes: Car, Pedestrian, and Cyclist and contains both LIDAR and camera sensor data, as well as the generation of sensor calibration matrices.
Figure 1: 3D bounding boxes projected on LiDAR point cloud. |
Figure 2: 2D & 3D bounding boxes with occlusion and unique id. |
Due to its numerous applications across various industries, including robotics and autonomous driving, 3D object detection has been gaining more attention from businesses and academia. LiDAR sensors are commonly used in robotics and autonomous vehicles to collect 3D scene data as sparse and erratic point clouds, which has shown to serve as helpful cues for 3D scene perception and comprehension.
We trained quite a few LiDAR-based networks, namely PointRCNN, PVRCNN, and SFA3D, and a Multimodal(RGB + LiDAR) 3D object detection model, i.e. MVXNet on the CARLA synthetic dataset but fine-tuned only one of these i.e., SFA3D, mainly because it is faster and uses less memory without much loss in performance. Although, any other model could have performed better if optimized and tuned further than just training a baseline, as shown in the following panel.
Super Fast and Accurate 3D object detection is based on 3D LiDAR Point Clouds. The ResNet-based Keypoint Feature Pyramid Network (KFPN), builds the backbone of the detector and was proposed in RTM3D.
The model takes a bird's-eye-view (BEV) map as input. The height, intensity, and density of 3D LiDAR point clouds are used to encode the BEV map. On the other hand, it outputs a heatmap for the main center, the center offset, the heading angle, the dimensions of the object, and the z coordinate.
Figure 3: SFA3D predictions on held-out test set. |
As for the loss functions, the focal loss is used for the main center heatmap, and l1 loss for the heading angle (yaw). It employs balanced l1 loss for the z coordinate and the three dimensions (height, width, and length). We trained the model for a total of 300 epochs by setting equal weights for the aforementioned loss components using a cosine LR scheduler with an initial learning rate of 0.001 and a batch size of 32 (on two RTX 2080Ti). Refer to the following wandb panels for results with SFA3D experiments on all towns and Town01, respectively.
TensorRT enables developers to optimize inference by leveraging CUDA libraries. TensorRT supports both INT8 and FP16 post-training quantization, which greatly reduces application latency and is required for many real-time services, as well as autonomous and embedded applications.
As a first step, we convert the SFA3D PyTorch model to ONNX, and use the ONNX parser to convert ONNX model to TensorRT. We could also bypass the parser and directly convert from PyTorch to TensorRT, doing so would require us to write the SFA3D network in TensorRT network-definition API, which would be time intensive and result in negligible speed benefit but could be more efficient on an embedded device like a Jetson Nano.
In addition, we examined benchmarks across multiple frameworks like TVM and ONNX to ensure that TensorRT is the best performing. From the above results, it is clear that TensorRT aids in obtaining higher throughput on the same hardware. Furthermore, quantization to FP16 boosts performance even more. On RTX2080Ti, TensorRT may be the most efficient solution for SFA3D, but it's also possible that another framework, such as Apache TVM, performs better on a different device with the same or another network; thus, results may vary depending on the hardware.
To make it easier to compare the model's predictions with CARLA's ground truth, we incorporated the model into Carlafox and made them available in a separate Foxglove image panel. For more details on the Carlafox visualizer, please refer to this dedicated blog post.
Numerous open-source resources paved the way for us to accomplish our work. In the future, we plan to finetune the trained models with the official KITTI dataset. Because of the expenses associated with acquiring real-world data, the use of synthetic data for training machine learning models has grown in popularity in recent years. This is especially true in the case of autonomous driving due to the rigorous requirement of generalizability to a wide range of driving conditions, so we hope our findings help others in research and development.
If you have questions or ideas on how to leverage synthetic data for 3D perception, join us on our Gitter #lounge channel or leave a comment in the comment section.
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (0)
Add a Comment