Project Documentation
## Object dimension measurement by its image

Many of the tasks done every day involve the measurement of dimensions of objects of any kind. Therefore, from fashion to space, measurement of dimensions of objects have a broad range of application. As technology is getting better in the way machines perceive the world, it becomes important for machines to determine the dimensions of objects around them as well. For this purpose, many technologies such as robots use expensive sensors or a complex set-up of cameras that would require calibration with a known reference object.

This project incorporates a Deep Learning approach, with two simple cameras (can even be a webcam) for eliminating the need for using a reference object-based approach or usage of expensive sensors for applications requiring measurement of dimensions as well as the distance to complex objects in the real world.

*
Note: For this case, all the dimensions will be in centimeters and millimeters. But you can change the unit of measurement in the program easily.
*

Here's an overview of the process that we are going to look at:

At first, we will have to determine the distance between the camera to the object. This is usually done using Ultrasonic/LIDAR/IR sensors. But in this scenario, we will eliminate the need for it by using a combination of two similar cameras to take a picture of the object. The cameras must be placed/fixed at a known distance 'm' from each other. The camera setup is illustrated in the following figure:

The images from these cameras will be fed into a YOLO v2 model for detecting the objects within the image and drawing a bounding box around them. And then, we will use the following formula for determining the distance from the cameras to the object in the image:

Where,

d -> Distance from the camera to the object

m -> Distance between the two cameras

a -> Pixel height of the object in the image from the first camera

b -> Pixel height of the of the same object in the image from the second camera

Feel free to check out the derivation of this formula

The bounding box by the YOLO model would give the positions of the edges that are required by this formula. Then, the height of the object in the image (in pixels) is measured from the bounding box. After that, use the following formula to obtain the real world height of the object (in cm):

And then real-world width of the object in the image is measured by:

Feel free to check out the derivation of these formulas

The following two pictures were taken by a OnePlus phone moved foward by 5 cm, instead of having 2 cameras placed 5 cm apart, as I did not have access to two web cams. Downloading these images for testing might not give the exact results. So, the same images are provided in the `pics/`

directory of the GitHub repository for this project. Feel free to use it.

The original dimensions for this phone (Redmi Note 4), along with the case was `16cm X 9 cm`

and this model determined it as `15.5cm X 8.5cm`

. Feel free to verify the numbers. But be careful if you are using a single camera and moving it, instead of placing 2 cameras.

For this, please look into the `README.md`

file in the GitHub repository for this project.

- There will be a Django-based web-app for you to try out the model using a web-based UI.
- There will also be a video with a Raspberry-pi setup, demonstrating the use of two fixed webcams. All the code will be made open-source so that you can build other projects over this technology.
- The model right now just provides the real-world 2D dimensions of the objects. The 3D dimension estimation will be incorporated too. Feel free to look into this paper that shows how we can get a 3D bounding box estimate.

To begin with, let's focus on the camera part towards the left of the diagram. The camera contains a lens whose image is produced on a CMOS sensor that is kept at the focal length of the camera `f`

. Initially, the camera will be capturing the object of height `h`

, at `position 0`

(the initial position), and the height of that object's image produced on the CMOS sensor of the camera will be `a`

. The angle between the actual object and the principal axis of the lens (let's call it `θ1`

) will be the same as the angle between the image produced on the CMOS sensor and the principal axis.

Then, when we move the camera by distance `m`

towards the object (or use the second camera placed at a distance `m`

from the first one), although the object will be of the same size, it would appear bigger in the image produced on the CMOS sensor of the camera. For demonstration purposes, we can look into it the other way around, where that the object appears to move towards the camera by the same distance `m`

and appears as height `b`

on the CMOS sensor. (If I try to move the lens on the diagram, it will complicate the rays near the lens and will make it difficult to understand). So, the object that would appear as the one with the height `b`

on the CMOS sensor will suspend the same angle `θ2`

with the principal axis as the actual object would in front of the camera.

With simple high-school trigonometry, we can determine that `tan(θ1)`

on the CMOS sensor side will be `a/f`

, where `a`

is the height of the object on the CMOS sensor and `f`

is the focal length of the lens (where the CMOS sensor is also placed). So, `tan(θ1)`

on the CMOS and the lens side (inside the camera). Outside the camera (between the lens and the object), `tan(θ1)`

will be `h/d`

, where `h`

is the real height of the object (we actually don't need to know about this, this is just for deriving the final formula, and will be eliminated in the later stages of this derivation). So, `tan(θ1) = h/d`

on the side between the lens and the actual object.

From the things above, it is evident that `tan(θ1) = a/f`

as well as `tan(θ1) = h/d`

. This can be represented as: `a/f = tan(θ1) = h/d`

. And it further implies that `a/f = h/d`

(let's call this as *equation 1*).

In a similar way, for the object at `position 1`

(from the image captured by the camera in the front), on the side between the CMOS sensor and the lens, `tan(θ2)`

will be `b/f`

, where `b`

is the height of the image produced towards the CMOS sensor and `f`

is the focal length of the lens used in the camera. So, `tan(θ2) = b/f`

on the CMOS sensor side. And outside the camera (between the lens and the object), `tan(θ2)`

will be `h/(d-m)`

, where `h`

is the height of the object, `d`

is the distance between the object and the lens, and `m`

is the distance that the lens (in the derivation illustration, the object) has moved. So, `tan(θ2) = h/(d-m)`

on the outside of the camera.

Therefore, it is evident that `tan(θ2) = b/f`

and `tan(θ2) = h/(d-m)`

as well. So, we can represent these two equations as `b/f = tan(θ2) = h/(d-m)`

as `tan(θ2)`

is common. And further, it can be implied that `b/f = h/(d-m)`

(let's call this as *equation 2*).

When we divide `equation 1`

by `equation 2`

, we get:

(a/f)/(b/f) = (h/d)/(h/(d-m))

which can be simplified further as,

a/b = (d - m) / d

a/b = 1 - m/d

m/d = 1 - a/b

And finally,

d = m/(1 - a/b)

which can be simplified further as,

a/b = (d - m) / d

a/b = 1 - m/d

m/d = 1 - a/b

And finally,

d = m/(1 - a/b)

The distance found from the previous step will be used further in calculating the real height and the width of the objects using the focal length magnification formula `Hi/Ho = Di/Do`

, where `'Hi'`

is the height of the image (in pixels, given by YOLO's bounding box), `'Ho'`

is the real height of the object that we have to find, `'Di'`

is the distance between the lens to the image of the object that is formed (in this case, it is the focal length `'f'`

, as the image is formed on the CMOS sensor of the camera that is placed at the focal length of the lens), and `'Do'`

is the distance between the object and the lens (that is determined as `'d'`

in the previous step). So, this formula can be re-written as:

```
Hi/Ho = f/d
```

which can be further written in terms of `'Ho'`

(real height of the object) as:

```
Ho = (Hi * d)/f
```

Or in other words,

```
Real-world height of the object = (Height of the image x real-world distance to the object)/(Focal length of the lens of the camera used)
```

The same formula can be used to find the width of the object too.