In the last post just two days ago, I talked about the fundamental matrix and a homography which allowed 2D images to be warped in such a way that they overlap. That technique works a bit better if you take the photos from the same perspective point (more like a tripod looking around), because there will be less perspective distortion.

In this post, I'm discussing a bit more how 3D reconstructions are made. Using some photos from the same dataset as before, it will become apparent what good features are and how these eventually result in good or bad data. I'll try to upload them at maximum resolution so you can zoom in a bit this time. Warning, they're pretty big, 8000x3000. I hope it maintains the original size.

This is a picture demonstrating the inliers for the fundamental matrix solution of this image combination. In the previous post I discussed how algorithms like SIFT and SURF recognize features on the basis of local gradients. In this image you should be able to see exactly how these 200 features are recognized and what a good feature is. As you can see, areas that have poor local gradients aren't matched easily, my algorithm prunes these very early. Features that are really excellent and unique are shadows on the ground. That's because they are flat areas so their gradients don't change and the sun through the leaves creates irregular shapes that are good matches at full scale, but also increasing scales as happens in these feature recognition algorithms. The irregular rooftop is also a great source for features. It's easy to see some more global areas that probably did match, but don't stand out as very strong keypoints. What does this mean? It's important to select the right time of day for taking pictures! Hard cast shadows with strong sun may cause local gradients to disappear, a very low sun with soft shadows may not emit enough light for a suitable shutter time and with the sun right above your head the shadows may be minimal. It's a great area for research on what defines good shadows for perfect 3D reconstructions.

So how did I get from this image to the one above, where the points are 3D triangulated? Through the fundamental matrix we find the essential matrix if we have a camera that's calibrated (focal length and image center point). From there, the camera projection matrix P is derived for each image. That camera projection matrix describes a rotation and translation from one camera to the other in 3D space. The information it used to derive that is a list of normalized 2D coordinates from both images (our pixel matches!). Normalized means: radial distortion removed.

From here, things are starting to become simple. Knowing the orientation of cameras in 3D space (in this simple "2 camera virtual world"), then we can triangulate the matched 2D points in each image, basically a projection of a 3D point, to derive the X,Y,Z coordinate of that 3D point itself.

It's important to mention here that the actual position of that 3D point in this system is highly determined by the distance between the cameras. Unfortunately we don't know that yet, so the scale at this moment is arbitrary. For applications like stereo vision this distance *is* known, that's why triangulated points in such machines can derive pretty accurate depth information. In our case, we could scale our P matrix according to our GPS 'guesses'.

This is only a two-view solution. Through a process called "registration" though it's possible to incorporate more cameras in this simple 3D space because a lot of these cameras are associated through other image combination analysis. You'd apply the same process over and over again, every time creating more and more points and of course filtering similar points. Each 3D point in such algorithms usually has a list describing the cameras and the 2D point index corresponding to that 3D point, which is useful for refining the solution.

About refining... what I did not discuss here is bundle adjustment, smoothing and point filtering. You can see outliers in my solution and probably there's a bit of point jitter on surfaces that should be planar. There are techniques that can be applied to remove unwanted points automatically (mostly statistical analysis methods), smooth the point cloud and larger scale adjustments that sweep through your entire solution to reduce the overall error and derive the best fit by manipulating the constraints that you imposed during the construction. For example, you could relax constraints for a camera that had a very poor HDOP, whereas one with a strong HDOP/VDOP at the time could have a more stringent constraint. In the process of finding a better solution, this eventually leads to a model that doesn't converge so much to the best "average", but one that leans more to known correct measurements and has a low emphasis on possibly bad information that looks good.

## Comments

Awesome stuff--thank you for doing this--one question--do you have a camera recommendation?

Yep, this is good info. Thanks!

@Gary Good idea!

This is really valuable information. Thanks for sharing Gerard!

Keep it coming Gerard this is great stuff that could sit on Randys new wiki page as well http://planner.ardupilot.com/wiki/common-3d-mapping/