|
|
|||||||||||
Yandex TechnologiesMatrixNet: New Level of Search QualitySearch Quality
The job of a search engine is, first and foremost, to provide answers to user’s queries. In response to each query, a search engine returns links to web pages it finds in its index – a database of web pages known to this particular search engine. Thus, an answer to the user’s query comes in the form of search results – a list of hyperlinks to web pages, whose content matches this query.
This is how it works:
These days, a search query that would return fewer than a dozen results is hard to find. Most searches will retrieve links to millions of web pages. The number of answers potentially matching any given search query is growing increasingly fast along with the rapid growth of the internet. It doesn’t make much sense to provide the user with all potentially matching pages that exist – a person would have to browse through dozens of resources before anything useful comes up. Instead, what a search engine does is rank the search results placing the most relevant of them on top.
Looking at these search results, the user may feel quite satisfied, not really satisfied, or not satisfied at all. This subjective feeling of getting (or not getting) what one was searching for is what describes the quality of search from the user’s point of view – is this information useful for me? The trick is to describe and measure all these subjective attitudes and to take into account everyone. The quality of search depends on how well search results are ranked. Ranking means sorting search results in a way that meets user's expectations.
Machine learning
It’s impossible to build a perfect algorithm that would come up with the best possible result for every possible query. Yandex’s search engine processes more than 100,000,000 queries every day. Almost half of these queries are unique. To deal with this load of questions successfully, a search engine has to be able to make decisions based on the previous experience, that is, it has to learn.
Machine learning is essential not only in search technology. Speech or text recognition, for instance, is also impossible without a machine being able to learn. The term ‘machine learning’ coined in the ‘50s, basically, means the effort to make a computer perform the tasks natural to human behavior, but difficult for breaking down into algorithmic patterns ‘understandable’ by machines. A machine that can learn is a machine that can make its own decisions based on input algorithms, empirical data and experience.
Decision making, however, is a human quality, which a machine cannot really master. What it can do, though, is learn to create and apply a rule that would help to decide whether a particular web page is a good answer to user’s question or not.
This rule is based on properties of web pages and user’s queries. Some of these properties, like the number of links leading to a particular page, are static – describing a web page, while others, like whether a web page has words matching a search query, how many and where on a page, are dynamic – describing both a web page and a search query. There are also properties specific only to search queries, such as geolocation. For a search engine, this means that to give a good answer to a user’s question it has to factor in where this question has come from.
These quantifiable properties of web pages and search queries are called ranking factors. These factors are key in performing exact searches and making the decision on which results are the most relevant. For a search engine to return relevant results for a user’s query, it needs to consider a multitude of such factors.
Three types of ranking factors:
To approximate users’ expectations, a search engine requires sample user queries and matching results, which have already been considered satisfactory by the users. Assessors – people, who decide whether a particular web page offers a ‘good’ response to a certain search query – provide their evaluations. A number of search responses, together with corresponding queries, make up a learning sample for a search engine ‘to learn to find’ certain dependencies between these web pages and their properties. To represent real users’ search patterns truthfully, a learning sample has to include all kinds of search queries in the same proportion as they occur in real life.
After a search engine has found dependencies between web pages in the learning sample and their properties, it can choose the best ranking formula for the search results it can deliver to a specific user’s query and return the most relevant of them on top of all the rest.
Think of teaching a machine how to pick the most delicious apples. First, assessors take a bite of each apple in a ‘tasting crate’ and put all tasty apples to the right and all sour apples to the left. This crate contains all sorts of apples in the same proportion as they are likely to grow in the garden. A machine cannot taste apples, but it can analyze their properties, like size, color, sugar content, firmness, presence or absence of a leaf. The tasting crate is a learning sample, which allows the machine to learn to select the apples with the winning combination of properties: size, color, sweetness and firmness. Errors are unavoidable, though. For instance, if a machine does not have any information about insect larvae, the best apples it has selected might hide a worm. To minimize the probability of error, a machine needs to consider a maximum number of apples’ properties.
MatrixNet
Machine learning has been implemented in search technologies since the early noughties. Different search systems use different models. One of the problems in machine learning is overfitting. An algorithm that overfits its data is like a sophomore medical student who diagnoses himself with every possible symptom he has read about in his manual. Not having been exposed to the real practice yet, he makes up causes for the natural things he observes.
When a computer uses a large number of factors (properties of web pages and search queries, in our case) on a relatively small learning sample (‘good’ results as estimated by assessors), it begins to find dependencies that do not exist. For example, a learning sample might accidentally include two different pages each having the same particular combination of factors, like they both are 2 KB, with purple background and feature text, which starts with “A”. And, by sheer chance, these pages both happen to be relevant to the search query [apple]. A computer may deem this accidental combination of factors to be essential for a search result to be relevant to the search query [apple]. At the same time, all web pages offering really relevant and useful information about apples, but lacking this particular combination of factors, will be considered less important.
In 2009 Yandex launched MatrixNet, a new method of machine learning. A key feature of this method is its resistance to overfitting, which allows the Yandex’ search engine take into account a very large number of factors when it makes the decision about relevancy of search results. But now, the search system does not need more samples of search results to learn how to tell the ‘good’ from the ‘not so good’. This safeguards the system from making mistakes by finding dependencies that do not exist.
MatrixNet allows generate a very long and complex ranking formula, which considers a multitude of various factors and their combinations. Alternative machine learning methods either produce simpler formulas using a smaller number of factors or require a larger learning sample. MatrixNet builds a formula based on tens of thousands of factors, which significantly increases the relevance of search results.
Another important feature of MatrixNet is that allows customize a ranking formula for a specific class of search queries. Incidentally, tweaking the ranking algorithm for, say, music searches, will not undermine the quality of ranking for other types of queries. A ranking algorithm is like complex machinery with dozens of buttons, switches, levers and gauges. Commonly, any single turn of any single switch in a mechanism will result in global change in the whole machine. MatrixNet, however, allows to adjust specific parameters for specific classes of queries without causing a major overhaul of the whole system.
Change of a single parameter in different ranking formulas:
In addition, MatrixNet can automatically choose sensitivity for specific ranges of ranking factors. It’s like trying to hear someone whisper on the airfield. Figuratively speaking, MatrixNet can hear both the whisper and the sound of planes landing or taking off.
Ranking
For each user’s query, a search engine has to evaluate properties of millions of pages, assess their relevancy and rank them accordingly with the most relevant on top. Scanning each page in succession either would require a huge number of servers (that could deal with all those pages very quickly) or would take a lot of time – but a searcher cannot wait. MatrixNet solves this problem as it allows to check web pages for a very large number of ranking factors without increasing processing power.
In response to each query, more than a thousand servers simultaneously perform a search. Each server searches within its own part of index to produce a list of the best results. This list is guaranteed to include web pages most relevant to this query.
The next step is to produce one final list of top results based on all those lists of the most relevant pages produced by each server. These results are, then, ranked using that long and complicated MatrixNet formula, which allows to consider a multitude of ranking factors and their combinations. Thus, the most relevant websites find their way to the top of search results for the user to receive an answer to their question almost instantly.
This is approximately how ranking works:
============
Yandex Traffic Jam Technology OverviewHow Yandex.Traffic Works
Yandex.Traffic shows the picture of the current traffic conditions in a city. It gathers information from different sources, analyses this data, and maps the results on the city’s map on Yandex.Maps. For the larger cities, where traffic jams is a serious problem rather than a small inconvenience, the service also calculates the average levels of traffic congestion on a scale from 0 to 10. To put on the map the accurate and current picture of traffic conditions in the city, Yandex.Traffic goes a long way.
Sourcing data
Imagine getting in a small traffic accident – no victims, just a couple of scratches. Your bad luck blocked two out of three lanes on a major city road. The drivers in these two lanes now have to drive around your scratched vehicle, while the drivers in the third lane have to let these cars into the unobstructed lane. Some of the drivers in this lane use the Yandex.Maps application on their mobile devices, which sends information about movements of their car to analytical center at Yandex.Traffic. As these drivers are approaching the spot of your accident, they are slowing down, and their mobile devices start signaling a possibility of a traffic jam to the service.
To participate in the common effort of gathering traffic information, a motorist needs an internet-connected mobile device (cellphone, smartphone, PDA) with a GPS function (built in or using an external receiver). After downloading the Yandex.Maps mobile application and activating the “Send traffic information” option, the user can be sure that their device will be sending its geographic coordinates, direction and speed to Yandex.Traffic’s automated analytical system every three minutes. All information is non-personal, which means that there is nothing that could possibly betray any specific information about the user or their car. Then, using all available data, the automated analyzer creates a track, an integrated route for each vehicle considering the speed, with which this vehicle drives along this route. Along with contributions from private users, traffic information is also supplied by Yandex partners, the companies, whose fleet vehicles operate in the city on a regular basis.
In addition to sending their coordinates to Yandex.Traffic, drivers can also signal to the service about traffic accidents or road works. Having spotted your accident, for instance, a conscious driver can mark it on the map with a small, round accident icon in his mobile Yandex.Maps application. There are icons for marking road works as well.
Another data source for Yandex.Traffic in Moscow is road cameras placed along the major roads, large junctions and road intersections (some of these cameras are available live streaming online at Yandex.Maps). Operators visually monitoring traffic via these cameras evaluate current conditions based on the speed and intensity of traffic on the visible section of the road. While they would mark the part of the road before the spot of your accident “green” – free flowing traffic, they would have to change the color to “yellow” – slow traffic flow – as the traffic approaches the place where cars have to make it past your immobile vehicle. And if the road police do not hurry, operators will have to use “red” to signal a traffic jam. The major highways, such as MKAD, Moscow’s Third Ring Road, have special camera detectors that can identify cars, calculate traffic density and rate traffic conditions automatically.
Tracks Processing Technology
To make a track for a moving car, Yandex.Traffic needs a number of geographic coordinates of this car provided by the driver’s GPS device and sent to the service by the Yandex.Maps application. The problem here is that GPS accuracy allows for errors between one and ten meters in any direction, which may result in “moving” your car on a sidewalk or to the rooftop of the nearest building. GPS coordinates provided by the user are mapped to the city’s electronic map, which accurately displays buildings, parks, roads with all markings and other urban facilities. This detailed mapping allows the system correct the course of a car based on the real physical layout, even if the GPS coordinates say that the car is on the wrong side of the road or has cut through a building instead of turning around the corner following the road markings.
Another important issue is to understand how useful the speed information received from the driver is, as it may or may not truthfully reflect the real situation on the road. The reason why the driver sending information about his movements via Yandex.Maps application has slowed down or stopped may not be the general traffic conditions, but that he was not sure if he needed to turn, or wanted to grab some milk from a corner shop. If all other cars on the same route that send information to Yandex.Traffic proceed as normal, the rogue track will be ignored by the system for evaluation of the general traffic intensity. That is why the number of users matters. The more drivers use Yandex.Maps to send information to Yandex.Traffic, the more accurate the picture of the real road situation is.
Based on a number of reliable tracks, the system gives a particular segment of the road on the map “green”, “yellow” or “red” color depending on traffic density on a corresponding section of the road, the same way as cameras monitoring operators do.
Merging Data
The next step is to bring all available information together. Every two minutes a program aggregates like a jigsaw puzzle all information from cameras and mobile Yandex.Maps users and maps the results on the Traffic Jams layer both in the mobile application and on Yandex.Maps online. If the information for the same section of the road from different sources does not match, the program chooses the most reliable in terms of probability based on the number of tracks and their recency.
Point scale
For Moscow, Saint Petersburg, Ekaterinburg and Kiev, where traffic jams almost have turned into a natural disaster, Yandex.Traffic offers a ten point scale for levels of traffic congestion, with zero points meaning free flowing traffic and 10 points signaling a ‘complete standstill’. This scale helps drivers to instantly estimate how much time roughly they are likely to waste in a traffic jam. So, an average seven points for Kiev will mean for a driver that the travel will take approximately twice as long as when the traffic is free flowing.
For each of the cities, the scale is localized – what is a slight sluggishness in Moscow is a big problem in some other city. Congestion level of six points in Saint Petersburg will make a local driver waste as much time as a motorist in Moscow would spend traveling at five points.
The congestion level scale is based on reference time, which is the time it takes for a car to drive through a standard route, which covers all main roads and avenues, without breaking the law. Considering the general traffic congestion at the moment, the aggregator program calculates the difference between the reference time and the time calculated from the speed information it receives from each particular driver. Using the time difference for every single route, the program calculates the mean average, which translates into points of traffic congestion level for the whole city.
Related links
Download Yandex.Maps for mobile (in Russian)
Download report “Traffic in Moscow, Saint Petersburg, Ekaterinburg and Kiev, 2008” (in Russian) (pdf, 1,2 ÌB)
Download report “Traffic in Moscow, winter 2007” (in Russian) (pdf, 0,8 ÌB)
Download report “Traffic in Moscow, summer 2007” (in Russian) (pdf, 1,3 ÌB)
Yandex.Traffic promo (in Russian)
Yandex.Traffic video (in Russian)
============
Also Yandex offers a broad range of services that are free to our users and that enable them to find relevant and objective information quickly and easily and to communicate and connect over the internet. The table below provides an overview of the principal services we currently offer to our users:
Our server application for indexing and searching files in various popular formats on web servers, local networks and database management systems. Our specialized toolbar for the user’s browser, offering features such as fast and convenient search and access to personalized Yandex services. Yandex.Bar was installed more than 300,000 times per month in the first quarter of 2008. Our desktop application for indexing and searching files on a personal computer and on local networks, enabling full-text search of various file types, such as Microsoft Word and Adobe Acrobat as well as the most popular email programs in Russia. Our server-side spam filtering solution designed for corporate users and internet service providers. As with Yandex.Mail, our software for servers performs comprehensive analyses of thousands of e-mail properties, measuring their significance, and ensuring precise filtering. Our online notification service, which alerts a user on his or her desktop when a new message has been delivered to his or her Yandex.Mail account. |
|||||||||||