Duplicate Images and Yandex Image Search

Sometimes a picture is really worth a thousand words. This is exactly when an image search engine turns out to be quite handy. Obviously, to someone who just wants to know what a [fennec fox] is, to look at a picture makes more sense than to read all about the shape of the animal’s ears or the length of its tail. A graphic image is complementary to almost any search result and is essential to queries like [Rothko], which retrieve on Yandex not only the information about the artist, but also images of his masterpieces.

Yandex retrieves images using the text that somehow refers to this image — it could be a tag, text description, html title, page title, a link to this image from another website. Of the billions of images available through Yandex, only half are unique. The remaining half is the so-called duplicate images, those that do not differ at all or differ insignificantly.

Yandex.Images, a specialized image search service, categorizes duplicate copies it finds into four classes:

Exact duplicates — absolutely identical copies of an image that do not differ a bit.

Thumbnail duplicates — image copies that differ only in size, i.e. a full image of a picture on an art gallery website and a thumbnail image of this picture in the navigation menu on this web site.

Semi-duplicates — watermarked, slightly colour-enhanced, framed or cropped copies of the same image.

Enhanced semi-duplicates — considerably colour-enhanced, altered or fragmented copies of the same image.

Every image available on the web has, on average, three duplicate copies. To filter out identical images in its search results, Yandex.Images sorts them into groups before processing.

Grouping duplicate images

For a computer to recognize images and find duplicates, visual content has to be translated into digital language, a language a computer can understand. Yandex.Images has a special computer system to do this. This system, whose programs find images on the web and process them, extracts the necessary information about each image such as size, colour, format (JPG, PNG, etc.), and creates a signature, a digital description of this image.

To create an image signature, the system chooses a meaningful fragment of the image, shrinks this fragment to 16х16 pixels and assigns to each of the 256 pixels a number matching its brightness. The resulting number sequence is a signature of the image.

The images with identical fragments (and, consequently, identical signatures) are grouped together. Then, the images within these groups are sorted into smaller groups of even more identical images — those sharing no less than two fragments. The most closely matching images are classed as potential duplicates. The program marks image areas that include all of the matching fragments, shrinks these areas down to about 60x60 pixels, digitalizes them and compares the signatures. The images with matching signatures are classed as duplicates.

This method of grouping allows Yandex.Images to processes over 2bn images very quickly.

Using duplicate images

Most images on the web have text descriptions. This exactly is what the Yandex.Images search technology uses to perform the searches. Copies of the same image on different websites are likely to have different text descriptions. When grouping images, the service reads and compares descriptions of all duplicates to find common or similar text fragments and filter out odd images that happen to have a matching text description. Matching text description with visual content allows the service boost the relevance of search results.

Let’s say, a photo of a long blue car has forty duplicate copies. Fifteen of these duplicates are tagged “car”, ten of them have a “blue car” tag, five duplicates are tagged “green car” and the remaining ten are tagged “long”. The ratio of the number of times each tag is used to the total number of duplicates shows how well this tag matches the image it describes:

[car] — 0.75 (30 images out of 40);
[blue] — 0.25 (10 images out of 40);
[long] — 0.25 (10 images out of 40);
[green] — 0.125 (5 images out of 40).

If 30 out of 40 images of a car have the tag ‘car’, most likely, this text description matches what it describes. This image will be quite relevant to the user looking for [car] on Yandex.Images. A duplicate copy with text description other than ‘car’ might also be relevant to the search query [car] and even to queries like [blue car] or [long car], since the degree of match between the content of the car-image and its description is still quite high, even if not complete. If text descriptions of some duplicates in the search engine’s database are mutually exclusive like ‘green car’ and ‘blue car’ for essentially identical images, the service returns that image whose description has a higher rate of occurrence.

Along with being instrumental to providing relevant search results within milliseconds, grouping images also expands search experience for users. After clicking on an image in the Yandex.Images search results, the user is directed to the page, which shows a list of copies of this image and a link to the page with every copy of this image the search engine could find. The list of copies allows the user to choose the image of the right size or find out which websites host this image or what photo bank it can be purchased from.

The duplicate detecting technology also helps identify adult content on websites and use this information for refining search results in family search or in moderated search. Yandex’s signature database contains numerical characteristics for images hosted on adult content websites. Using a special algorithm, the search engine checks a newly indexed website for adult content if signatures of the images it hosts match the signatures of the adult content images in its database. This mechanism allows users enable Family search to filter out from their search results controversial websites or images.

Most copied images

Duplicates of the same image, a photo of a popular model of a cell phone, for instance, may have tens of thousands copies. Photos of celebrities and consumer products are the most likely subjects for copying.