PyTorch-BigGraph 评估

PyTorch-BigGraph 评估

During training, the average loss is reported for each edge bucket at each pass. Evaluation metrics can be computed on held-out data during or after training to measure the quality of trained embeddings.
在训练过程中,每一次epoch中的每一个桶的的平均损失会被汇报。可以在训练期间或之后根据保留的数据计算评估指标,以测量训练后嵌入的质量。

Offline evaluation¶

The torchbiggraph_eval command will perform an offline evaluation of trained PBG embeddings on a validation dataset. This dataset should contain held-out data not included in the training dataset. It is invoked in the same way as the training command and takes the same arguments.
torchbiggraph_eval命令将会对验证数据集执行训练后嵌入的离线评估。该数据集应包含训练数据集中未包含的保留数据。它以与训练命令相同的方式调用,并采用相同的参数。

It is generally advisable to have two versions of the config file, one for training and one for evaluation, with the same parameters except for the edge paths, in order to evaluate a separate (and often smaller) set of edges. (It’s also possible to use a single config file and have it produce different output based on environment variables or other context). Training-specific config parameters (e.g., the learning rate, loss function, …) will be ignored during evaluation.
通常建议配置文件有两个版本,一个用于训练,一个用于评估,除了edge paths参数外,它们具有相同的参数,以便评估一组单独的(通常是较小的)边。(也可以使用单个配置文件,并根据环境变量或其他上下文产生不同的输出)。 评估过程中将忽略训练专用的配置参数(例如,学习率,损失函数等)。

The metrics are first reported on each bucket, and a global average is computed at the end. (If multiple edge paths are in use, metrics are computed separately for each of them but still ultimately averaged).
首先在每个存储桶上报告指标,最后计算全局平均值。 (如果正在使用多个edge paths ,则将针对每个边缘路径分别计算指标,但最终仍将其平均)。

Many metrics are statistics based on the “ranks” of the edges of the validation set. The rank of a positive edge is determined by the rank of its score against the scores of a certain number of negative edges. A rank of 1 is the “best” outcome as it means that the positive edge had a higher score than all the negatives. Higher values are “worse” as they indicate that the positive didn’t stand out.
许多指标是基于验证集边的“rank”的统计信息。正边的rank由得分相对于一定数量的负边缘的得分的rank确定。rank=1表示“最佳”结果,因为这意味着正边比所有负边的得分都高。较高的rank值表示性能较差,因为其表明正边并没有突出显示出来。

It may happen that some of the negative samples used in the rank computation are in fact other positive samples, which are expected to have a high score and may thus cause adverse effects on the rank. This effect is especially visible on smaller graphs, in particular when all other entities are used to construct the negatives. To fix it, and to match what is typically done in the literature, a so-called “filtered” rank is used in the FB15k demo script (and there only), where positive samples are filtered out when computing the rank of an edge. It is hard to scale this technique to large graphs, and thus it is not enabled globally. However, filtering is less important on large graphs as it’s less likely to see a training edge among the sampled negatives.
在计算rank的过程中,某些负样本实际上可能是其他正样本,这些样本可能会得到较高的得分因此会对rank的结果造成不利影响。在较小的图上,尤其是在使用所有其他实体构造负样本时,这种效果尤其明显。为了解决该问题并使其与文献中通常所做的事情相匹配,FB15k演示脚本(仅在此处)使用了所谓的“过滤后”等级,在计算边rank时会滤除正样本。很难将此技术适用于到大图中,因此无法全局启用。但是,过滤在大型图上的重要性较弱,因为它不太可能在负样本中使用训练的边。

The metrics are:
相关的指标有

• Mean Rank: the average of the ranks of all positives (lower is better, best is 1).
• Mean Reciprocal Rank (MRR): the average of the reciprocal of the ranks of all positives (higher is better, best is 1).
• Hits@1: the fraction of positives that rank better than all their negatives, i.e., have a rank of 1 (higher is better, best is 1).
• Hits@10: the fraction of positives that rank in the top 10 among their negatives (higher is better, best is 1).
• Hits@50: the fraction of positives that rank in the top 50 among their negatives (higher is better, best is 1).
• Area Under the Curve (AUC): an estimation of the probability that a randomly chosen positive scores higher than a randomly chosen negative (any negative, not only the negatives constructed by corrupting that positive).

Evaluation during training

Offline evaluation is a slow process that is intended to be run after training is complete to evaluate the final model on a held-out set of edges constructed by the user. However, it’s useful to be able to monitor overfitting as training progresses. PBG offers this functionality, by calculating the same metrics as the offline evaluation before and after each pass on a small set of training edges. These stats are printed to the logs.
离线评估是一个缓慢的过程,其目的在于边训练完成后,对保留的边进行评估来评估训练的模型。但是,训练过程中监控过度拟合是很有用的。PBG通过在每次经过一小组训练边之前和之后计算与脱机评估相同的指标来实现训练时评估。 这些统计信息将打印到日志中。

The metrics are computed on a set of edges that is held out automatically from the training set. To be more explicit: using this feature means that training happens on fewer edges, as some are excluded and reserved for this evaluation. The holdout fraction is controlled by the eval_fraction config parameter (setting it to zero thus disables this feature). The evaluations before and after each training iteration happen on the same set of edges, thus are comparable. Moreover, the evaluations for the same edge chunk, edge path and bucket at different epochs also use the same set of edges.
指标是在一组边上计算的,这些边会从训练集中自动保留下来(不被训练,仅作为评估)。更明确地说:使用此功能意味着会训练较少的边,因为某些边被排除并保留用于此评估。保持比例由eval_fraction配置参数控制(将其设置为零将禁用此功能)。每次训练迭代之前和之后的评估都在同一组边上进行,因此具有可比性。此外,在不同时期对相同边缘块,边缘路径和存储桶的评估也使用相同的边缘集。

Evaluation metrics are computed both before and after training each edge bucket because it provides insight into whether the partitioned training is working. If the partitioned training is converging, then the gap between the “before” and “after” statistics should go to zero over time. On the other hand, if the partitioned training is causing the model to overfit on each edge bucket (thus decreasing performance for other edge buckets) then there will be a persistent gap between the “before” and “after” statistics.
评估指标在训练每个边缘存储块之前和之后都会进行计算,因为它可以洞悉分区的训练是否有效。如果分区训练正在收敛,则“之前”和“之后”统计量之间的差距应随时间推移变为零。另一方面,如果分区训练导致模型在每个边缘存储区上过拟合(从而降低了其他边缘存储区的性能),则“之前”和“之后”统计量之间将存在持续的差距。

It’s possible to use different batch sizes for same-batch and uniform negative sampling by tuning the eval_num_batch_negs and the eval_num_uniform_negs config parameters.
通过调整eval_num_batch_negs和eval_num_uniform_negs配置参数,可以对同一批次和统一负采样使用不同的批次大小。


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!