### A twoview 3D reconstruction

Click for big!

In the last post just two days ago, I talked about the fundamental matrix and a homography which allowed 2D images to be warped in such a way that they overlap. That technique works a bit better if you take the photos from the same perspective point (more like a tripod looking around), because there will be less perspective distortion.

In this post, I'm discussing a bit more how 3D reconstructions are made. Using some photos from the same dataset as before, it will become apparent what good features are and how these eventually result in good or bad data. I'll try to upload them at maximum resolution so you can zoom in a bit this time. Warning, they're pretty big, 8000x3000. I hope it maintains the original size.

So how did I get from this image to the one above, where the points are 3D triangulated? Through the fundamental matrix we find the essential matrix if we have a camera that's calibrated (focal length and image center point). From there, the camera projection matrix P is derived for each image. That camera projection matrix describes a rotation and translation from one camera to the other in 3D space. The information it used to derive that is a list of normalized 2D coordinates from both images (our pixel matches!). Normalized means: radial distortion removed.

From here, things are starting to become simple. Knowing the orientation of cameras in 3D space (in this simple "2 camera virtual world"), then we can triangulate the matched 2D points in each image, basically a projection of a 3D point, to derive the X,Y,Z coordinate of that 3D point itself.

It's important to mention here that the actual position of that 3D point in this system is highly determined by the distance between the cameras. Unfortunately we don't know that yet, so the scale at this moment is arbitrary. For applications like stereo vision this distance *is* known, that's why triangulated points in such machines can derive pretty accurate depth information. In our case, we could scale our P matrix according to our GPS 'guesses'.

This is only a two-view solution. Through a process called "registration" though it's possible to incorporate more cameras in this simple 3D space because a lot of these cameras are associated through other image combination analysis. You'd apply the same process over and over again, every time creating more and more points and of course filtering similar points. Each 3D point in such algorithms usually has a list describing the cameras and the 2D point index corresponding to that 3D point, which is useful for refining the solution.

About refining... what I did not discuss here is bundle adjustment, smoothing and point filtering. You can see outliers in my solution and probably there's a bit of point jitter on surfaces that should be planar. There are techniques that can be applied to remove unwanted points automatically (mostly statistical analysis methods), smooth the point cloud and larger scale adjustments that sweep through your entire solution to reduce the overall error and derive the best fit by manipulating the constraints that you imposed during the construction. For example, you could relax constraints for a camera that had a very poor HDOP, whereas one with a strong HDOP/VDOP at the time could have a more stringent constraint. In the process of finding a better solution, this eventually leads to a model that doesn't converge so much to the best "average", but one that leans more to known correct measurements and has a low emphasis on possibly bad information that looks good.