Technical Summary
Last updated
Last updated
The TrailNet DNN is a novel deep neural network architecture that is optimized to perform within the hardware limitations of the Jetson TX2. It is possible to retrain the weights of the TrailNet DNN by creating your own set of training images, but that is beyond the current scope of this documentation. Here we are interested in investigating how well the network is able to maintain the pose of the drone with respect to the center of the trail. The stability of the controller is integral to the networks ability to make accurate inferences about the current orientation and position of the drone. They modified their loss function to include the cross entropy and entropy reward terms in order to increase the stability of the outputs.
Once the DNN has completed its convolutions on the input image, it outputs a 1x6 vector that provides the predicted view orientation (vo) and lateral offset (lo) with respect to the center of the trail. If for example it is inferred that the current view orientation is 0.9 (90%) left and 0.1 (10%) right then alpha (yaw angle) would become negative. Neglecting the second term for now, this would result in a clockwise yaw angle, which would cause the drone to turn back towards the center of the trail. Beta is essentially the proportional gain for the controller that corresponds to 10 degrees in our case.
In addition to publishing a new yaw angle, the px4 controller node also computes the next local waypoint based on the current orientation and position of the drone. This
Altogether the final commanded 'goto' pose message is built, using a fixed altitude, fixed constant forward velocity, and the computed waypoint.
The navigation state includes conditional statements for if the joystick node is running, allowing override of the published pose commands within ROS without needing to switch out of OFFBOARD mode on the flight controller.
It has been a challenge in the computer vision community to find a robust solution to extracting depth from a single image, because fundamentally there is a lack of geometric information available from 1 picture to extract true depth measurements. Stereo cameras have 2 image sensors which can be calibrated and rectified in order to begin to compute the disparity image which represents the depth of objects in the frame. Instead of trying to continue to find a monocular depth solution, the redtail team worked towards creating a novel semi-supervised DNN architecture that used individual stereo image disparity maps to estimate depth from a single image. Their approach used a combination of supervised and unsupervised learning, which at the time of its development was known to be the first network approaching this problem to have done so.
Eq. (2) ensures photometric consistency, Eq. (3) compares the estimated disparities to the sparse LIDAR data, Eq. (4) ensures that the left and right disparity maps are consistent with each other, and Eq. (5) encourages the disparity maps to be piecewise smooth.
The photometric consistency of the images provides an unsupervised learning approach to training the networks, while the inclusion of the ground truth lidar data provides supervised learning. In their research, they experimented with training exclusively supervised and unsupervised networks and the results with the least amount of error came from the combined approach.
We have tested this network on our drone and are working to integrate the output disparity map into our GAAS project. The major advantage that this approach to computing depth is that we can get a better representation of the disparity image prior to filtering it for our point cloud to obtain an occupancy map. Further, this algorithm is optimized to run using CUDA, cuDNN, and Tensor RT libraries which make use of the GPU. This will leave room for other functionality to be integrated on the CPU and most other disparity image generation algorithms such as block matching (BM) are extremely CPU intensive. A classical way to generate a comparable disparity image similar to the one inferred by the Stereo DNN is to use semi-global block matching (SGBM) which is extremely CPU intensive and smooth such that any excessive filter would remove major features from the point cloud and occupancy map.
The result below with the WLS filter is the best result achievable with respect to computational efficiency. It is extremely important that the generated disparity image is computed quickly and contains just enough important disparity information about the environment to generate a point cloud for the occupancy map.
[1] Smolyanskiy, Nikolai, et al. “On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach.” ArXiv.org, 8 July 2020, arxiv.org/abs/1803.09719.
[2] Smolyanskiy, Nikolai, et al. “Toward Low-Flying Autonomous MAV Trail Navigation Using Deep Neural Networks for Environmental Awareness.” ArXiv.org, 22 July 2017, arxiv.org/abs/1705.02550.