Computer vision. How it works

Looking at a picture, almost without thinking, we can say what we see. We can tell a house from a tree or a mountain. We understand which object is closer to us and which is farther away. We recognise that the house’s walls are red and the tree’s leaves are green. We can say with confidence that the picture is a landscape and not a portrait or still life. And we come to all these conclusions in a matter of seconds.

There are many tasks that computers perform better than humans. They are much quicker at performing calculations. Yet the seemingly simple task of finding a house or a mountain in a picture can leave a machine stumped. Why does this happen?

People learn to recognise objects – that is, find and distinguish them from others – all their lives. They see houses, trees and mountains countless times: not only in real life, but in pictures, photos and films. They know how various objects look from different angles and in different lighting.

Machines were created to work with numbers. The need for them to have vision arose relatively recently. To identify licence plate numbers, read bar codes on packages in a supermarket, analyse surveillance camera recordings, find faces in photographs, teach robots how to spot and bypass obstacles – to perform all these tasks a computer needs to “see” and interpret what is seen. The range of methods allowing a computer to be taught how to retrieve information from an image, whether a picture or video recording, constitute what is termed computer vision.

House hunting

For a computer to find, say, a house in a picture, it has to be taught to do so. And that requires compiling a training set of images. This training set has to be large enough – a machine cannot learn from just a couple of examples, and it has to be representative – it must reflect the nature of the data to be worked with. Also, it must include both positive – “there’s a house in this picture”, and negative – “there’s no house in this picture” – examples.

After we’ve put together the training set, machine learning comes into play. In its course of study, the computer analyses the images from the set and determines which signs or combinations of signs indicate that in the picture there’s a house, and calculates their significance. If the tuition has been successful, which is confirmed by testing, the machine can apply its acquired knowledge in practice – that is, find a house in any picture.

Image analysis

For a person, it’s nothing to differentiate what’s important from what’s unimportant in a picture. For a computer, it’s much more complicated. While people process images, computers work with numbers. For a computer, a picture is a set of pixels, each of which has its own significance in terms of brightness and colour. To translate the contents of a picture into a format a machine would be able to work with, the image is processed using special algorithms.

First, the potentially important parts of the picture are identified – those that might be objects, as well as their borders. There are several ways to do this. One of them is the Difference of Gaussians algorithm, or DoG, in which the image is subjected to Gaussian blurring a few times, each time using a different blurring radius. The results are then compared. This allows the most contrasting fragments to be identified – for example, bright spots or outlines.

Next, this information has to be expressed in numbers. The numerical data pertaining to a picture fragment is called a descriptor. With the help of descriptors, image fragments can be compared quickly, completely and precisely, without using the actual fragments. There is a number of algorithms that can be used to identify the key areas of a picture and obtain their descriptors – for example, SIFT, SURF, HOG.

Since a descriptor is a numerical description of data, the comparison of images – one of computer vision’s most important tasks – comes down to comparing numbers. However, this can demand significant computational resources, so descriptors are divided into groups, or clusters. Similar descriptors from different images are grouped into the one cluster. The operation of dividing descriptors into clusters is called clusterisation.

After clusterisation, the descriptor itself no longer needs to be looked at; what is now important is just the number of the cluster that it was sorted into. Going from a descriptor to the number of its cluster is called “quantisation”, while the cluster’s number is a “quantised descriptor”. Quantisation significantly reduces volumes of data that need to be processed.

A computer uses quantised descriptors to recognise objects or compare images. For object recognition, quantised descriptors are used to train a “classificator" – an algorithm that separates images “with a house” from images “without a house”. To compare images, a computer makes sets of quantised descriptors from different pictures and then draws a conclusion about the similarity of the pictures or of individual fragments. Duplicate image filtering and content-based image search on Yandex are based on this method.

This is just one of many approaches to analysing images. There are other methods – artificial neural networks, for instance, are increasingly used in the recognition of images. They can identify relevant classification criteria while learning. Narrow, specific fields use their own methods to work with images – for example, to read barcodes.

Applications of computer vision

n the ability to recognise, computers still cannot compete with humans. Machines excel only at specific tasks, such as recognising numbers or machine-written text. Successfully differentiating cats from dogs (in real life, that is, and not in a laboratory) is still very difficult for a computer. That’s why the system used in Yandex.Images, for instance, primarily conducts searches for “cat” or “dog” by analysing not the actual images but rather the text that accompanies them.

In certain cases, however, computer vision can be a powerful aid. One such case is processing human faces. It involves two related but different tasks: detection and recognition.

Often it’s sufficient to simply find (that is, detect) a face in a photograph, without identifying whose face it is. The “Portraits” filter on our image search service, Yandex.Images, works like this. The search query [формула 1] (formula 1) will mostly yield pictures of racing cars, but if the user specifies an interest in “Portraits”, Yandex.Images will show pictures of the competitors.

Sometimes it’s not enough to just look for people – their faces have to be recognised – “This is Vladimir”. Our photo sharing service, Yandex.Fotki, has this function. The system automatically tags people in new photographs if a few photos of the same people tagged by someone by hand have already been uploaded to the service. If the system has access to 10 photos in which Vladimir is identified, it’s not difficult to recognise him in the 11th. If Vladimir does not want to be recognised, he can block identification of himself in photos.

One of the most promising spheres for the application of computer vision is in augmented reality, which is technology that overlays virtual elements, such as text prompts, on a picture of the real world. An example could be mobile apps that allow a user to receive information about a building by capturing it in the camera of a phone or tablet computer. Although programs, services and devices that use augmented reality do currently exist, this technology is still nascent.