Many of the tasks done every day involve the measurement of dimensions of objects of any kind. Therefore, from fashion to space, measurement of dimensions of objects have a broad range of application. As technology is getting better in the way machines perceive the world, it becomes important for machines to determine the dimensions of objects around them as well. For this purpose, many technologies such as robots use expensive sensors or a complex set-up of cameras that would require calibration with a known reference object.
This project incorporates a Deep Learning approach, with two simple cameras (can even be a webcam) for eliminating the need for using a reference object-based approach or usage of expensive sensors for applications requiring measurement of dimensions as well as the distance to complex objects in the real world.
Note: For this case, all the dimensions will be in centimeters and millimeters. But you can change the unit of measurement in the program easily.
Here's an overview of the process that we are going to look at:
At first, we will have to determine the distance between the camera to the object. This is usually done using Ultrasonic/LIDAR/IR sensors. But in this scenario, we will eliminate the need for it by using a combination of two similar cameras to take a picture of the object. The cameras must be placed/fixed at a known distance 'm' from each other. The camera setup is illustrated in the following figure:
The images from these cameras will be fed into a YOLO v2 model for detecting the objects within the image and drawing a bounding box around them. And then, we will use the following formula for determining the distance from the cameras to the object in the image:
d -> Distance from the camera to the object
m -> Distance between the two cameras
a -> Pixel height of the object in the image from the first camera
b -> Pixel height of the of the same object in the image from the second camera
Feel free to check out the derivation of this formula
The bounding box by the YOLO model would give the positions of the edges that are required by this formula. Then, the height of the object in the image (in pixels) is measured from the bounding box. After that, use the following formula to obtain the real world height of the object (in cm):
And then real-world width of the object in the image is measured by:
The following two pictures were taken by a OnePlus phone moved foward by 5 cm, instead of having 2 cameras placed 5 cm apart, as I did not have access to two web cams. Downloading these images for testing might not give the exact results. So, the same images are provided in the
pics/ directory of the GitHub repository for this project. Feel free to use it.
The original dimensions for this phone (Redmi Note 4), along with the case was
16cm X 9 cm and this model determined it as
15.5cm X 8.5cm. Feel free to verify the numbers. But be careful if you are using a single camera and moving it, instead of placing 2 cameras.
For this, please look into the
README.md file in the GitHub repository for this project.
To begin with, let's focus on the camera part towards the left of the diagram. The camera contains a lens whose image is produced on a CMOS sensor that is kept at the focal length of the camera
f. Initially, the camera will be capturing the object of height
position 0 (the initial position), and the height of that object's image produced on the CMOS sensor of the camera will be
a. The angle between the actual object and the principal axis of the lens (let's call it
θ1) will be the same as the angle between the image produced on the CMOS sensor and the principal axis.
Then, when we move the camera by distance
m towards the object (or use the second camera placed at a distance
m from the first one), although the object will be of the same size, it would appear bigger in the image produced on the CMOS sensor of the camera. For demonstration purposes, we can look into it the other way around, where that the object appears to move towards the camera by the same distance
m and appears as height
b on the CMOS sensor. (If I try to move the lens on the diagram, it will complicate the rays near the lens and will make it difficult to understand). So, the object that would appear as the one with the height
b on the CMOS sensor will suspend the same angle
θ2 with the principal axis as the actual object would in front of the camera.
With simple high-school trigonometry, we can determine that
tan(θ1) on the CMOS sensor side will be
a is the height of the object on the CMOS sensor and
f is the focal length of the lens (where the CMOS sensor is also placed). So,
tan(θ1) on the CMOS and the lens side (inside the camera). Outside the camera (between the lens and the object),
tan(θ1) will be
h is the real height of the object (we actually don't need to know about this, this is just for deriving the final formula, and will be eliminated in the later stages of this derivation). So,
tan(θ1) = h/d on the side between the lens and the actual object.
From the things above, it is evident that
tan(θ1) = a/f as well as
tan(θ1) = h/d. This can be represented as:
a/f = tan(θ1) = h/d. And it further implies that
a/f = h/d (let's call this as equation 1).
In a similar way, for the object at
position 1 (from the image captured by the camera in the front), on the side between the CMOS sensor and the lens,
tan(θ2) will be
b is the height of the image produced towards the CMOS sensor and
f is the focal length of the lens used in the camera. So,
tan(θ2) = b/f on the CMOS sensor side. And outside the camera (between the lens and the object),
tan(θ2) will be
h is the height of the object,
d is the distance between the object and the lens, and
m is the distance that the lens (in the derivation illustration, the object) has moved. So,
tan(θ2) = h/(d-m) on the outside of the camera.
Therefore, it is evident that
tan(θ2) = b/f and
tan(θ2) = h/(d-m) as well. So, we can represent these two equations as
b/f = tan(θ2) = h/(d-m) as
tan(θ2) is common. And further, it can be implied that
b/f = h/(d-m) (let's call this as equation 2).
When we divide
equation 1 by
equation 2, we get:
The distance found from the previous step will be used further in calculating the real height and the width of the objects using the focal length magnification formula
Hi/Ho = Di/Do, where
'Hi' is the height of the image (in pixels, given by YOLO's bounding box),
'Ho' is the real height of the object that we have to find,
'Di' is the distance between the lens to the image of the object that is formed (in this case, it is the focal length
'f', as the image is formed on the CMOS sensor of the camera that is placed at the focal length of the lens), and
'Do' is the distance between the object and the lens (that is determined as
'd' in the previous step). So, this formula can be re-written as:
Hi/Ho = f/d
which can be further written in terms of
'Ho' (real height of the object) as:
Ho = (Hi * d)/f
Or in other words,
Real-world height of the object = (Height of the image x real-world distance to the object)/(Focal length of the lens of the camera used)
The same formula can be used to find the width of the object too.