
Ranking Problems

In many domains, data scientists are asked to not just predict what class/classes an example belongs to, but to rank classes according to how likely they are for a particular example.

Classification Ranking
Order of predictions doesn’t matter Order of predictions does matter

This is often the case because, in the real world, resources are limited.

This means that whoever will use the predictions your model makes has limited time, limited space. So they will likely prioritize.Some domains where this effect is particularly noticeable:

• Search engines: Predict which documents match a query on a search engine.
• Tag suggestion for Tweets: Predict which tags should be assigned to a tweet.
• Image label prediction: Predict what labels should be suggested for an uploaded picture.
• 搜索引擎:预测哪些文档匹配搜索引擎上的查询。
• 推文标签建议:预测应将哪些标签分配给推文。
• 图像标签预测:预测应该为上传的图片建议哪些标签。

If your machine learning model produces a real-value for each of the possible classes, you can turn a classification problem into a ranking problem.

In other words, if you predict scores for a set of examples and you have a ground truth, you can order your predictions from highest to lowest and compare them with the ground truth:

• Search engines: Do relevant documents appear up on the list or down at the bottom?
• Tag suggestion for Tweets: Are the correct tags predicted with higher score or not?
• Image label prediction: Does your system correctly give more weight to correct labels?
• 搜索引擎:相关文档是否出现在列表的顶部或底部?
• 推特的标签建议:是否以较高的分数预测了正确的标签?
• 图像标签预测:系统是否正确地赋予了正确标签更多的权重?

In the following sections, we will go over many ways to evaluate ranked predictions with respect to actual values, or ground truth.

Sample dataset (Ground Truth)

We will use the following dummy dataset to illustrate examples in this post:

Precision @k

Precision @k is simply precision evaluated only up to the k-th prediction, i.e.:

例子: Precision @1 = 1/1 =1 , Precision @4 = 3/4 =0.75, Precision @8 = Precision = 4/8 = 0.5

Recall @k

Recall @k is simply Recall evaluated only up to the -th prediction, i.e.:

例子: Recall @1 = 1/4 =0.25 , Recall @4 = 3/4 =0.75, Recall @8 = Recall = 4/4 = 1

F1 @k

F1 @k is a rank-based metric that can be summarized as follows: “What F1-score do I get if I only consider the top k predictions my model outputs?

例子: F1 @1 = 2x (1/1 * 1/4 )/(1/1+1/4) = 0.4

AP (Average Precision)

AP is a metric that tells you how much of the relevant documents are concentrated in the highest ranked predictions.

So for each threshold level (k) you take the difference between the Recall at the current level and the Recall at the previous threshold and multiply by the Precision at that level. Then sum the contributions of each.


MAP (Mean Average Precision)

AP (Average Precision) is a metric that tells you how a single sorted prediction compares with the ground truth. E.g. AP would tell you how correct a single ranking of documents is, with respect to a single query.

AP: Informs you how correct a model’s ranked predictions are for a single example
MAP: Informs you how correct a model’s ranked predictions are, on average, over a whole validation dataset.

源地址: https://queirozf.com/entries/evaluation-metrics-for-ranking-problems-introduction-and-examples#ranking-problems

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!