The title says it all; why would you need a controller to steer a drone, when you can just use a water bottle?
To show what is possible with the DJI Windows SDK and Cognitive Services, I wanted to control a drone with a physical object, in this case a water bottle.

How does it work? The concept is as follows: the drone sends video footage to a laptop. When the frames arrive at the laptop, each frame is analysed by an image recognition model and the position of the water bottle is determined. When the position of the water bottle is known, the drone will be steered to keep the water bottle in the center.
Creating such an object detection model is quite hard, especially when you want to build it from scratch. However, when using Azure Cognitive Services this is made a lot easier. To do this in Cognitive Services you can use Custom Vision, which allows you to create object detection models without any coding.

The end result can be seen here in a short GIF:

Drone following water bottle

Technical background

Object detection

To detect objects in the images I am using Custom Vision. This service allows you to create object detection models without coding. By uploading multiple images of a specific object and tagging these objects, the service will train the model automatically. You can choose to have a quick training or give the service more time to calculate a better model. In the case of these models (deep neural networks) more training time will likely lead to more accurate models.

Image tagging inside custom vision service
Image tagging inside custom vision service

Normally the execution of these models is also done in the Custom Vision service. However, Custom Vision also allows the created models to be exported to ONNX models. When the models are exported, these can be executed everywhere, just as mobile apps or web servers.

ONNX is a open format to represent neural networks

Control application

To execute the object detection model and control the drone I created a C# UWP application
Inside the application, the DJI SDK receives the raw footage and encodes this using the FFmpeg library. Once an image is decoded and transformed into a VideoFrame the Windows Machine Learning API executes the object detection model on that specific frame. The results are bounding boxes around the detected objects, which gives the location and size of the detected object in the actual image. Using this location a choice is made: steer the drone right, left or keep center.
This loop continues until the command to land is given after which the drone commences its auto land sequence.
You might wonder why the Custom Vision service is not used to evaluate the model, as it would be less complex. The reason for that is two fold:

  1. To connect with the drone the WiFi connection of the laptop is used, therefore there is no access to the public internet anymore.
  2. In a real world scenario having a dependency on a proper internet connection is not always viable or desirable.

End result of the application:

Key learnings:

This is my first C# application that I have developed. Therefore, I have a lot of learnings of which some I'd like to share:

  1. Using the class you can set the execution device LearningModelDevice, when setting this to GPU (DirectXHighPerformance) this significantly speeds up the execution of the model.
  2. The application performance was not acceptable initaly. A fix which really helped me was to execute the evalutation function inside a Dispatcher.RunAsync block. I am not sure at all if this is a good coding practice, however this really helped the performance.

Hope you enjoyed reading this, if you have any questions, just post a comment below.