On the difficulty of pleasing everyone

Have you ever recommended a movie you loved to someone and got surprized when this person didn't like your recommendation? "I guess we don't have the same tastes for movies!" - you might have heard. Giving good recommendations for books, movies, clothes or sports equipment is very difficult especially if you know little about the person.

Online algorithm-based recommender systems face the same challenge. In our previous blog post we talked about how Netflix relies on their in-house movie recommender system to help viewers discover movies they will like. Most of the time, recommended movies should make sense to you as a viewer, but this is not always the case.

Here is a comment from one of Netflix' own blog posts:

bla

And a more radical one:
bla

A cursory search on the web will reveal that some users have similar complaints about Youtube, Goodreads, and others.

So, what's going on here? Are these people just trolling? Well, maybe, but there is a different plausible explanation:

It's hard to please everyone

Product recommender systems such as the ones used by Netflix, Goodreads, Youtube, or the one we built at TasteHit are complex pieces of software generating recommendations in real time for millions of people across hundreds of countries every day. These systems are designed to serve recommendations individually to each visitor based on his profile, whether the visitor logged in or is anonymous.

These systems "work well" for most people, most of the time (more on this later). However, it is enough for the system to work poorly for even a tiny fraction of the users for these users to get frustrated.

via GIPHY

The source of the problem

A potential problem lies in how performance is tracked. When you suggest a movie to a friend, your goal is clear: make the best possible recommendation for your friend. The objective for online recommender systems is less clear: even though recommendations are served individually, the performance is measured across all users, and this makes things more complicated. One can imagine different possible goals, e.g.:

  • to have the best possible average performance (over all users), or
  • to have the best possible worst-case performance. No user should get very bad recommendations.

Ideally, one would want to have a system that works equally well for all users. Unfortunately, this is not possible. One therefore often tries to find a good trade-off between the above two examples. However, it is difficult to ensure that no sub-set of users gets bad recommendations. Yves Raimond and Justin Basilico from Netflix say so themselves:

The objective is to build recommendation algorithms that work equally well for all of our members; no matter where they live or what language they speak. But with so many members in so many countries speaking so many languages, a challenge we now face is how to even figure out when an algorithm is sub-optimal for some subset of our members.

In other words, the system might perform very poorly for some users, but it is difficult to figure out for which users this is the case, and why. This is an interesting problem. Let me explain in more detail.

How recommender systems work

Algorithms that recommend products to users often rely on interaction data between users and items. This interaction data can be represented as a matrix:

The rows of the matrix represent individual users and columns represent individual products. An online shop which has a catalog of 1,000 products and 100,000 unique visitors in the last month can therefore be represented as a matrix of size 100,000 x 1,000. This might not be the best representation, as this matrix tends to be huge (and therefore difficult to process), but the exact representation does not matter for our discussion.

Every time a visitor interacts with a product (sees a product page, or watches/rates a movie), the corresponding cell in the matrix can be filled with a score, say "1" if the user interacted with the product in any way. If there has been no interaction between a user and an item, the corresponding cell in the matrix does not have a value.

Most visitors on online shops interact with only one or two products on the site. Here is the distribution of the number of page views per visitor at one of our customers:

This means that most rows will have just one cell with a known value ("1" in our case). Hence the matrix of interaction data is very sparse: most values of the matrix are unknown. The problem the product recommendation algorithm tries to solve is to guess the unknown values. In other words:

"What score rui would a user u give to a product i ?"

This question is interesting because the predicted value of rui gives us an indication of how likely a user u is to interact with an item i in the future. High values represent high probabilities of future interaction, and vice versa. To make, say, ten product recommendations to a given user, we need to find the ten products (columns) that have the highest predicted rui value.

But how are the missing rui predicted?

In the science of recommender systems, the approach of using the behavior of all visitors to predict the behavior of one individual visitor is called collaborative filtering. More precisely, since we are talking about guessing the missing values of a sparse matrix, this is also a matrix completion problem.

Collaborative filtering is one approach to the problem. There are other techniques (such as content-based filtering) and other challenges related to the type of data which is analyzed (e.g. differences in popularity of products, importance of context page on which the recommendation is given, etc...) that are interesting to talk about. I will keep that for another blog post.

These methods have a number of internal parameters that have to be optimized (or trained), using the matrix of interaction data (the rui values) as input. This training procedure attempts to optimize the result on a pre-defined goal, or metric, such as:

  • the average click-through rate of recommendations, or
  • the number of products added to cart after a click on a recommendation, or
  • the root-mean-square error (RMSE), which was used for example during the famous Netflix prize.

Note that these metrics all attempt to optimize the average performance of the system. An online shop usually monitors his conversion ratio (percentage of visitors converted into paying customers), which is an average metric across all of the shop. Giving good recommendations to most visitors is "good enough": it will most likely have a positive effect on conversion. But what about the visitors who don't like the recommendations they see?

Let us assume that a system was optimized with the goal of having a good average performance with a metric such as the RMSE. The performance of the system, and therefore the perceived quality of recommendations, will likely be very different for each user. To train the system to work equally well for all users, we would have to:

  • figure out for which users the system does not work well, and
  • design a new metric that attempts to prevent the system from performing poorly on this group of users.

These two steps are hard to automate, since it is unlikely that there is a well-defined group (such as "all users in South America") for which the system does not perform well. But putting this difficulty aside, we would then optimize the system again, this time on our newly defined metric. The likely result is that we would now observe that the system performs poorly on a new and different subset of users. We could then repeat the whole procedure ad infinitum without ever completely solving the problem.

Exploration vs. Exploitation

Another important consideration regarding the effectiveness of recommendations for individual users is the so-called exploration vs. exploitation trade-off.

Recommender systems can be thought of as making hypotheses on user's tastes, such as "I bet Bob will like X". If Bob indeed buys/watches/clicks on X, the recommender system's hypothesis is confirmed. Otherwise, perhaps the recommender system needs to make a new hypothesis regarding Bob's tastes.

The recommender system is in full exploitation mode when it makes no new hypotheses regarding Bob's tastes. It exploits the knowledge on Bob it already has (i.e. it uses all available knowledge to make informed predictions for the missing rui). In this mode, Bob is unlikely to be surprised by any recommendations he sees. If he watched a lot of action movies, it's likely he will see recommendations for more action movies.

via KDNuggets

In contrast, the system is in full exploration mode if the system tries out a lot of new and risky hypotheses (i.e. it makes wild, perhaps random, predictions for the missing rui). As an extreme example, the system might try to recommend romance movies to Bob, even though he previously only watched action movies. After all, it is possible that Bob likes both action and romance movies. The exploration mode is important because it is helpful in exploring and discovering new facets of a user's tastes.

A delicate balance has to be found between the exploration and exploitation modes for the system to be perceived as useful. If too much emphasis is put on the exploitation mode, the system will be perceived as too predictable and users might complain that the system did not make them discover new things. If too much emphasis is put on the exploration mode, the system will give the appearance of not working at all. Striking the right balance is difficult enough in itself, but an additional difficulty lies in the fact that each user probably prefers a different trade-off between the two modes.

Solution: Putting users in control

We talked about two apparently insurmountable problems in making a perfect recommender system that is 100% automatic. A potential solution to the problem is to create a system that is not 100% automatic, but provides some level of control to the users.

Some of our widgets at TasteHit, such as the Smart Menu do exactly that: recommendations are automatic by default, but the user has various ways of interacting with, and modifying the recommendations he receives, such as settings filters on various attributes of the items, such as the price, the brand, or the color.

The Smart Menu

Conceptually, it is a mix between a search bar (manual operation) and a personalized product feed (automatic operation), with the aim of getting the best of both worlds. These types of widgets are particularly effective in mobile environments where screens are small and bandwidth is scarce.

In a system that is 100% automatic, there is no way for the users to tell the system that it is wrong, which can be very frustrating (see the two comments regarding Netflix' system at the top of this article). Giving users the ability to easily modify the recommendations is therefore helpful in avoiding user frustration.

An added benefit of a system that is not 100% automatic, but in which the users interact with the system itself, is that more interaction data is collected. In fact, this type of interaction data is likely to be of higher quality than simple product page visits, because it provides explicit feedback on the quality of the recommendations. The collection of additional, high-quality data is useful in making the automatic recommendations better.

Conclusion

Fully automatic recommender systems are unlikely to always provide recommendations that are perceived as useful. One reason for this is that it is very difficult to build a data-driven algorithm that works well in all situations. Systems that are used by a large number of users, or have a large number of items are particularly at risk of being perceived as "wrong" or "useless" by some users even if the performance is good on a global scale and brings value to the site using it.

It is important to understand that users like to interact with recommendations by picking content, setting filters and discarding products they don't like. Giving some amount of control back to the user is a good solution for making useful to the user, while learning more on his tastes.