Very nice use of the Microsoft Azure cloud AI APIs to make a AR.Drone do some pretty magic stuff. Excerpt from the full O'Reilly post, which has the full details and code:
Figure 3. Flying the drone in my “lab.” Source: Lukas Biewald.The Azure Face API is powerful and simple to use. You can upload pictures of your friends and it will identify them. It will also guess age and gender, both functions of which I found to be surprisingly accurate. The latency is around 200 milliseconds, and it costs $1.50 per 1,000 predictions, which feels completely reasonable for this application. See below for my code that sends an image and does face recognition.
I used the excellent ImageMagick library to annotate the faces in my PNGs. There are a lot of possible extensions at this point—for example, there is anemotion API that can determine the emotion of faces.Running speech recognition to drive the drone
The trickiest part about doing speech recognition was not the speech recognition itself, but streaming audio from a webpage to my local server in the format Microsoft’s Speech API wants, so that ends up being the bulk of the code. Once you’ve got the audio saved with one channel and the right sample frequency, the API works great and is extremely easy to use. It costs $4 per 1,000 requests, so for hobby applications, it’s basically free.
RecordRTC has a great library, and it’s a good starting point for doing client-side web audio recording. On the client side, we can add code to save the audio file:
I used the FFmpeg utility to downsample the audio and combine it into one channel for uploading to Microsoft:
While we’re at it, we might as well use Microsoft’s text-to-speech API so the drone can talk back to us!Autonomous search paths
I used the ardrone-autonomy library to map out autonomous search paths for my drone. After crashing my drone into the furniture and houseplants one too many times in my livingroom, my wife nicely suggested I move my project to my garage, where there is less to break—but there isn’t much room to maneuver (see Figure 3).
When I get a bigger lab space, I’ll work more on smart searching algorithms, but for now I’ll just have my drone take off and rotate, looking for my friends and enemies:
Comments
Voice control technology never matured due to problems with semantics and lexicology.
Years ago Opera web browser implemented gesture control plugin to let you control opened web page with one's face's gestures.
Do we control web browsing with face gestures ?
No.
Keyboard, mouse, joystick, mouse's wheel provide 10-100 faster communication channel than plain verbal communication.
Voice control is ok to control your lift to select the floor, 3rd floor, open, close, stop ..
Voice control should be avoided in case of cars, drones, boats, since in case of emergency, verbal communication is not supported by non-verbal communication to let the stupid machine to act quickly to prevent the crash.
Face recognition algorithms adopted by security monitoring cams never worked fine.
It's not smart to request the drone to Find Chris and to fly to his face.
It looks like the old good guys at Microsoft, Intel just got retired and kids started to play with new technologies
Did you mean to ask how much Microsoft pays for blog comments? Or is the astroturfing conspiracy even more twisted and sinister than I can imagine?! Either way, diydrones.com comments aren't worth much, unfortunately.
Which brings me to Amazon's Alexa Skills Kit, which offers a very similar service of intent classification and entity extraction. I compared the two services early this year, so things may have changed, but I thought Amazon's had a nicer syntax for defining your intents and entities, while Microsoft's was more flexible because you could give it text input, while Amazon's only accepted audio.
How much does Microsoft charge per 1000 blog comments?
It doesn't look like the code is actually available, unless I missed it.
Microsoft also has a very easy to use intent classification and entity extraction service, LUIS, which is perfect for taking text from speech recognition, possibly containing recognition errors, and finding the best match against a collection of intents. E.g. if you give it 10 examples of how someone might tell the drone to look for a person, and then you give it as input the text "look board chris" (where the speech recognizer misheard "board" instead of "for"), it can classify the input as a request to look for a person, and it can even pull out the person's name. It's a great combination of robustness to errors and flexibility.
Also relevant, my post about a speech interface to a drone that also used node-ar-drone: http://diydrones.com/profiles/blogs/voice-controlled-drone