Is Deep Learning the future of UAV vision?


A few weeks ago Chris Anderson shared a post about the efforts of bringing obstacle avoidance to PX4. This efforts are based on Simultaneous Location And Mapping  (building a map of the world in real time) and Path Planning -- For the sake of conciseness i will call this approach SLAMAP.


I want to offer some thoughts on why this might not be a very promising approach, and propose an alternative that i think is more interesting.


One of the major shortfalls of SLAMAP is its inability of handling dynamic objects (objects that move)

For example, the block on the middle of this 3D map could either be a still box, or a vehicle moving at full speed towards the drone. Just take a moment to think how fundamental the perception of motion is to your ability of moving around or driving. Think about how would it be to have no information about the velocity of the objects around you.


Another problem is the binarity of the information stored in 3d maps — The cell is either empty or solid, seen or unseen.


Look at this image. To Assume that the cloud of snow powder is a solid wall is a huge constraint on the path of the drone. If the drone were following the snowboarder from behind, it would be forced to suddenly stop.


The idea of seen and unseen is also very limiting. If a building you have previously seen is now out of line of sight it’s reasonable to assume it hasn’t moved. But you cannot say the same about a person or vehicle. Similarly you can’t assume that there is nothing in every place you looked and saw nothing. — This relates to the lack of understanding of dynamic objects.


What I am arguing here is that to successfully interact with the environment, a drone needs a semantic understanding of the world (materials, physics etc) and ability to handle uncertainty. SLAMAP can’t do that.


Another difficulty with SLAMAP is that the utilization of this framework to provide the desired functionality is not trivial. Path planning solves the problem of finding a feasible route from A to B. But following a target is not the same problem. Reframing the problem to track a target, while also filming beautiful smooth footage and avoiding obstacles is very hard.


And finally there is a very empirical argument against SLAMAP. After decades of research it seems to have failed at finding applications outside academia. Most industrial applications in which SLAMAP is used, are simple and highly controlled — nothing like drone flying.


In short, the shortfalls of SLAMAP are:

  • Does not support dynamic objects

  • Does not handle uncertainty

  • Does not have semantic understanding of the world

  • Is difficult to get the desired behaviours

  • Empirically, it has been around for quite a while and hasn’t been very successful.


So, is there another option? Yes there is, and is called Deep Learning.


The idea behind deep learning is drop all the handmade parts of a system, and replace them all with a single neural net. This neural net can then be trained with examples of how to do a task, so that it learns the desired behaviour.


So, instead of having a stereo visión block, sparse SLAM block, dense octomap bock, path planning block, etc, now there would be a single neural net. To train it to control a drone you could use two basic methods. Give it footage of a person flying it or simply tell it if it’s doing a good job or not — or a mixture of the two.


Deep learning has proven incredibly successful in a wide variety of tasks. One of the earliest and most important is object recognition. This year deep nets outperformed humans on the ImageNet challenge — classifying an image among a 1000 classes — achieving an error rate of 3.5%. For perspective, humans are estimated to score around 5.5% while the best non deep learning approaches get arround 26%.


The state of the art speech recognition systems are also based on deep learning. See this post by google.


Deep learning has also been used to outperform the current systems in video classification, handwriting recognition, natural language processing, human pose estimation, depth estimation from a single image and many others.


And it has also been applied to broader tasks outside classification. One of the most famous examples is the deep net that learned to play atari games, sometimes at a superhuman level.


And of course Alphago, a go player that recently beat the go master Lee Sedol 4-1. A feat that was thought by many to be decades away.


Very recently NVIDIA published a paper of an end to end steering system for a car. It showed a simple deep net — so simple that i was amazed — with only 70 hours of driving experience, running on a mobile GPU at 30 fps perform very well at driving on all kinds terrains and weathers.


But aside from off the empirical success of deep learning, the reason i believe it is more promising than SLAMAP is that it has the capacity to understand all the things SLAMAP cannot. All of the inherent limitations of SLAMAP i previously mentioned don’t exist in a deep net.


A deep net can learn to understand a dynamic world – tell the difference between a truck moving at 100 mph and at rest. And they can also learn meaningful semantics like: that snow powder is nothing to worry about, but that water is dangerous. And it can then learn how to use  this understanding


It might seem too good to be true. But would it really be that surprising that the methods that succeed were based on the only machine that can currently do this task — the brain.


I am pretty sure that with the available mobile hardware, deep learning frameworks and sea of freely accessible research, a decent team in less than a year would develop a better system than SLAMAP would ever lead to.


Do you agree?


E-mail me when people leave their comments –

You need to be a member of diydrones to add comments!

Join diydrones


  • Well, the more I read and experiment, the more I come to the conclusion that , this is the only way to get a AUV flying into the unknown...  The hiking trail is probably the most impressive demonstration so far and we get new demonstration coming everyday.

    To best resume, here is what NVIDIA is writing about it here

    Historically, object detection systems depended on feature-based techniques which included creating manually engineered features for each region, and then training of a shallow classifier such as SVM or logistic regression. The proliferation of powerful GPUs and availability of large datasets have made training deep neural networks (DNN) practical for object detection. DNNs address almost all of the aforementioned challenges as they have the capacity to learn more complex representations of objects in images than shallow networks and eliminate the reliance on hand-engineered features.

  • Hi Carles,

    I and my superclocked 970 GTX will get right on it - probably not.

    But I will keep my eye on this channel and as time progresses I plan on sticking my virtual toe in the water too, if not exactly diving right in.

    I still suspect that simpler deep learning augmentations can be safely added to new or existing deterministic procedures to provide beneficial enhancements before going full bore deep.

    I do realize (and in fact hope) that that approach may only be valid for a short time.

    I truly hope I live long enough to see deep learning come to full fruition and actually realize it's potential.

    That is not a pessimistic outlook, I am 70 years old.



  • Hi gary,

    The DGX-1 is the most powerful GPU oriented computer that exists, but you can do a lot with a much cheaper computer. In fact most people that work with deep learning use much smaller computers. With a gtx 970 you can start to do a lot of stuff. I for example use two 980 ti.

    But if you need a lot of power in a few months you will be able to use computers similar to the DGX in the cloud -- even clusters of them. You can just rent them while you need them.

    But i insist that for this project the training procedure would not be the most challenging thing. That would probably be data collection, and reward procedure.

  • Hi Carles,

    I would very much like to see your follow on post.

    Although the Nvidia TK1 and TX1 may have the possibility of operating as on board processors I suspect what is really needed for the learning / training phase is something more like the DGX-1 which is a bit out of most DIYers budget I am afraid.

    Of course this is early days.



  • Hi john,

    This is a very big project for one person, because it requires a lot more work than training a net.

    The only way I can see this being developed is if a company develops the necessary framework for a competition. So that teams at Universities can focus on the juicy stuff.

    If I have time I will write another post detailing what framework would be needed to impulse open research in this area.

  • Thank you Carles for an amazing post. The entire reason I started following DIYdrones was to do something new. Today. Please continue with this idea. If you have any thoughts on where a newby could start that would be appreciated. I have some Pi's, Duinos, and Beagles laying around looking for an application. Maybe somebody can suggest the best place on the website to start a new project here. I think using the geofence function from the autopilot might be a way to have some minimal oversight of the new experimental stuff. Thanks.
  • Hi Carles,

    Actually the example snow cloud photo you presented is an excellent example of why a simple preprogrammed approach might be superior.

    In that simple approach, the snow cloud would likely be interpreted as any other solid and thus avoided.

    A deep learning model however might conclude that a snow cloud is not solid and therefore might be safely flown through without consequence and 9 times out of ten or 99 times out of 100 it might be right.

    Then there is that time where the second snow boarder is hiding in the snow cloud.

    Of course, with enough varied examples, the deep learning model would incorporate that too and avoid snow clouds, taking us back to where we started.

    Don't get me wrong, I truly believe that one day, deep learning or something very like it will rule absolutely.

    I am just a bit skeptical that today is that day.

    BTW I coined statistical recurrence estimate, because, in practice, that is what any "learning" based model is.

    To "learn" you observe a number of incidences of the desired action and draw inferences from the data presented.

    Note I said inferences, because no matter how many repetitions you make, you cannot draw a deductive conclusion.

    And inferences by their nature are always at best a statistical probability / possibility.

    Why, at this point in time I was suggesting that a combination of deterministic and deep learning methods might be superior, especially relating to situations where safety is concerned perhaps.

    (Much better to err on the conservative side.)

    Best Regards,


  • That is a good point, one of advantages of hand crafted algorithms is that we can transfer our knowledge into them.

    That is especially useful for events that might not be seen during training. What you can do is to manually supervise and punish actions that you know can have rare but catastrophic consequences. You can also augment the training data to the point it has more experience than any human.

    If you get ta network to perform to a good enough level, and sell drones with it. You can have thousands of drones helping to polish the border cases. All that data would quickly amount to more experience than any human could have.

  • Hi Hugues,

    It is true that usually neural networks for object recognition are designed to output a probability distribution over all the classes -- although it can also be done other ways.

    But that is not a weakness, in fact it is very useful as a measure of certainty. And keep in mind that the latest nets outperform humans at visual tasks like object recognition and cancer surveillance -- and probably would at tank detection.

    And again neural nets are almost always deterministic. You might be thinking of the training procedure -- which is in fact stochastic. 

    I don't know what you mean by "statistical recurrence estimate", I have never heard of it, and a quick google search doesn't show anything either.

    For you to get an idea of the networks that I am considering. They would output a distribution of expected rewards over possible actions. The hardest part would be designing a reward system to teach the net.

    And I want to acknowledge that the point you raise about nets not being 100% trustworthy is valid. There is no way to be sure that they will behave properly on new environments, but you can say the same thing about a hand crafted algorithm. You can only be sure your algorithm works when you have tested it, therefore you cant be sure it will work well in untested environments.

    The thing is that everywhere a deep learning algorithm is applied, it outperforms hand crafted approaches on unseen environments. 

    The only way that is see a mix of deep learning and SLAM, is if when training the net, limitations emerge. Then you could try to compensate this limitations by feeding it data from other algorithms.

    But the importance thing here is that we should take a learning first approach.

  • MR60

    @Carles, neuraml networks output a probability it recognized a pattern. It is a statistical recurrence estimate as Gary also mentions. The fact it is non deterministic is very very relevant in drone applications where you want to be absolutely sure to determine the behaviour of your drone. You do not want unpredicatbility in its behaviour.

    I developped years ago a neural net application to recognize tanks (yes, was for the army) automatically. I can tell you that the probability to get it right was far far away from 100%. Much too dangerous to let a robot gun decide alone, with its neural net, if the tank is friend or foe...

    The best approach is indeed to combine neural net "intelligence", with other deterministic sensors so you get indeed best of both worlds.

This reply was deleted.