PyTorch-BigGraph 损失计算

PyTorch-BigGraph 损失计算

The training process aims at finding the embeddings for the entities so that the scores of the positive edges are higher than the scores of the negative edges. When unpacking what this means, three different aspects come into play:
训练过程旨在找到实体的嵌入,使得正边实例的得分高于负边实例的得分。 具体的说,分为以下三个部分:

• One must first determine which edges are to be considered as positive and negative samples.
• Then, once the scores of all the samples have been determined, one must decide how to aggregate them in a single loss value.
• Finally, one must decide how to go about optimizing that loss.

• 首先必须确定哪些边被视为正样本和负样本。
• 然后,一旦确定了所有样本的得分,就必须决定如何将它们汇总为一个损失值。
• 最后,必须决定如何优化这一损失。

This chapter will dive into each of these issues.
本章将深入探讨这些问题。

Negative sampling

The edges provided in the input data are known to be positives but, as PBG operates under the open-world assumption, the edges that are not in the input are not necessarily negatives. However, as PBG is designed to perform on large sparse graphs, it relies on the approximation that any random edge is a negative with very high probability.
已知输入数据中提供的边为正实例,但是,由于PBG是基于开放世界假设的,因此不在输入中的边不一定为负实例。但是,由于PBG设计是在大型稀疏图运行的,因此它依赖于近似值,即任何随机边为负实例的可能性很高。

The goal of sampling negatives is to produce a set of negative edges for each positive edge of a batch. Usual downstream applications (ranking, classification, …) are interested in comparing the score of an edge (𝑥,𝑟,𝑦1) with the score of an edge (𝑥,𝑟,𝑦2). Therefore, PBG produces negative samples for a given positive edge by corrupting the entity on one of its sides, keeping the other side and the relation type intact. This makes the sampling more suited to the task.
负采样的目的是在一个批次上的为每条正边实例子产生一组负边实例。通常的之后的操作(排名,分类等)都希望将边(𝑥,𝑟,𝑦1)的得分与边(𝑥,𝑟,𝑦2)的得分进行比较。因此,PBG通过破坏其一侧的实体,保持另一侧和关联类型的完整性来生成给定正边实例的负样本。这使得采样更加适合任务。

For performance reasons, the set of entities used to corrupt the positive edges in order to produce the negative samples may be shared across several positive edges. The way this usually happens is that positive edges are split into “chunks”, a single set of entities is sampled for each chunk, and all edges in that chunk are corrupted using that set of entities.
出于性能原因,可用于破坏正边以生成负样本的实体集可在多个正边之间共享。这种情况通常发生的方式是,将正边分成“块”,为每个块采样一组实体,然后使用这样一组实体来作用到这个块中的所有边来完成负采样。

PBG supports several ways of sampling negatives:
PBG支持多种负采样:

All negatives

The most straightforward way is to use, for each positive edge, all its possible negatives. What this means is that for a positive (𝑥,𝑟,𝑦) (where 𝑥 and 𝑦 are the left- and right-hand side negatives respectively and 𝑟 is the relation type), its negatives will be (𝑥′,𝑟,𝑦) for all 𝑥′ of the same entity type as 𝑥 and (𝑥,𝑟,𝑦′) for 𝑦′ of the same entity type as 𝑦. (Due to technical reasons this is in fact restricted to only the 𝑥′ in the same partition as 𝑥, and similarly for 𝑦′, as negative sampling always operates within the current bucket.)
最直接的方式是对每个正边得到所有可能的负边。 这意味着对于一个正边(𝑥,𝑟,𝑦),它的负边为(𝑥’,𝑟,𝑦 )表示与𝑥具有相同实体类型的所有𝑥’以及(𝑥,𝑟,𝑦’)表示与𝑦具有相同实体类型的所有𝑦’。(出于技术原因,一般就一种,要不就x’,要不就y‘)

As one can imagine, this method generates a lot of negatives and thus doesn’t scale to graphs of any significant size. It should not be used in practice, and is provided in PBG mainly for “academic” reasons. It is mainly useful to get more accurate results during evaluation on small graphs.
可以设想,这种方法会产生很多负样本,因此无法适用于任何大小的图形。因此不应在实践中使用它,而在PBG中提供它主要是出于“学术”原因。 起作用主要是在小图上进行评估时获得更准确的结果。

This method is activated on a per-relation basis, by turning on the all_negs config flag. When it’s enabled, this mode takes precedence and overrides any other mode.
通过打开all_negs config标志,将基于每个关系激活此方法。 启用后,此模式将优先并覆盖其他任何模式。

Same-batch negatives

This negative sampling method produces negatives for a given positive edge of a batch by sampling from the other edges of the same batch. This is done by first splitting the batch into so-called chunks (beware that the name “chunks” is overloaded, and these chunks are different than the edge chunks explained in Batch preparation). Then the set of negatives for a positive edge (𝑥,𝑟,𝑦) contains the edges (𝑥′,𝑟,𝑦) for all entities 𝑥′ that are on left-hand side of another edge in the chunk, and the edges (𝑥,𝑟,𝑦′) with 𝑦′ satisfying the same condition for the right-hand side.
通过从同一批次的其他边进行采样,此负采样方法可为批次的给定正边生成负样本。这是通过首先将批处理拆分为所谓的块来完成的(请注意,“块”的名称已超载(有多重意思),并且这些块与“批处理”中介绍的边缘块不同),然后,一个正边(𝑥,𝑟,𝑦)得到的负样本为,这个块中的符合条件的(𝑥′,𝑟,𝑦)以及(𝑥,𝑟,𝑦′)。

For a single positive edge, this means that the entities used to construct its negatives are sampled from the current partition proportionally to their degree, a.k.a., according to the data distribution. This helps counteract the effects of a very skewed distribution of degrees, which might cause the embeddings to just capture that distribution.
对于一个正边来说,根据其生成的负边的实体来源于当前的分区,并根据数据的分布来按比例生成的。这有助于抵消度的非常偏斜的分布的影响,可以实现嵌入恰好捕获该分布。

The size of the chunks is controlled by the global num_batch_negs parameter. To disable this sampling mode, set that parameter to zero.
块的大小由全局num_batch_negs参数控制。 要禁用此采样模式,请将该参数设置为零。

Uniformly-sampled negatives

This last method is perhaps the most natural approximation of the “all negatives” method that scales to arbitrarily large graphs. Instead of using all the entities on either side to produce negative edges (thus having the number of negatives scale linearly withe the size of the graph), a fixed given number of these entities is sampled uniformly with replacement. Thus the set of negatives remains of constant size no matter how large the graph is. As with the “all negatives” method, the sampling here is restricted to the entities that have the same type and that belong to the same partition as the entity of the positive edge.
最后一种方法与“全负样本”的方法类似,但是该方法可适用于到任意大图。该方法对固定数量的这些实体进行均匀采样并替换而不是像“全负样本”一样完全生成负样本(负边的数量与图的大小成线性比例)。因此无论图形有多大,负样本数目保持恒定大小。 与“全负样本”方法一样,此处的采样仅限于具有相同类型且与正边实体属于同一分区的实体。

This method interacts with the same-batch method, as all the edges in a chunk receive the same set of uniformly sampled negatives. This caveat means that the uniform negatives of two different positives are independent and uncorrelated only if they belong to different chunks.
该方法与same-batch方法相互影响,因为块中的所有边都接收同一组均匀采样的负样本。这表明了两个正边生成的负样本只有在其属于不同组块时才是独立且不相关的。

This method is controlled by the num_uniform_negs parameter, which controls how many negatives are sampled for each chunk. If num_batch_negs is zero, the batches will be split into chunks of size num_uniform_negs.
此方法由num_uniform_negs参数控制,该参数控制为每个块采样多少个负边。 如果num_batch_negs为零,则将批次拆分为num_uniform_negs大小的块。

🤔 需要再琢磨琢磨

Loss functions

Once positive and negative samples have been determined and their scores have been computed by the model, the scores’ suitability for a certain application must be assessed, which is done by aggregating them into a single real value, the loss. What loss function is most appropriate depends on what operations the embeddings will be used for.
一旦确定了正负样本并通过模型计算了它们的得分,就必须评估得分对特定应用程序的适用性。一般来说,将这些得分汇总为一个实际值即损失。 哪种损失函数最合适取决于嵌入之后的的应用场景。

In all cases, the loss function’s input will be a series of scores for positive samples and, for each of them, a set of scores for corresponding negative samples. For simplicity, suppose all these sets are of the same size (if they are not, they can be padded with “negative infinity” values, as these are the “ideal” scores for negative edges and should thus induce no loss).
在所有情况下,损失函数的输入将是一系列正样本的得分,对于每个正样本,会有一组是其对应负样本的得分。 为简单起见,假设所有这些集合的大小都相同(如果不是,则可以用“负无穷大”值填充,因为这些是负边的“理想”分数,因此不应引起任何损失)。

Ranking loss

The ranking loss compares each positive score with each of its corresponding negatives. For each such pair, no loss is introduced if the positive score is greater than the negative one by at least a given margin. Otherwise the incurred loss is the amount by which that inequality is violated. This is the hinge loss on the difference between the positive and negative score. Formally, for a margin 𝑚, a positive score 𝑠𝑖 and a negative score 𝑡𝑖,𝑗, the loss is max(0,𝑚−𝑠𝑖+𝑡𝑖,𝑗). The total loss is the sum of the losses over all pairs of positive and negative scores, i.e., over all 𝑖 and 𝑗.
排名损失将每个正得分与其对应的负得分进行比较。对于每一个这样的配对,如果正值至少比负值大一个给定的余量,则不会造成任何损失,否则,其贡献的损失为超过的值。这其实是一种正负得分差的hinge loss。形式上,对于余量𝑚,正分数𝑠𝑖和负分数𝑡𝑖𝑗,损失为max(0,𝑚-𝑠𝑖+𝑡𝑖𝑗)。总损失是所有正负分数对(即所有𝑖和𝑗)的损失总和。

This loss function is chosen by setting the loss_fn parameter to ranking, and the target margin is specified by the margin parameter.
通过将loss_fn参数设置为等级来选择此损失函数,并且margin参数指定余量的大小。

This loss function is suitable when the setting requires to rank some entities by how likely they are to be related to another given entity.
当设置需要根据某些实体与另一个给定实体相关的可能性来对某些实体进行排名时,此损失函数适用。

Logistic loss

The logistic loss instead interprets the scores as the probabilities that the edges exist. It does so by first passing each score (whose domain is the entire real line) through the logistic function (𝑥↦1/(1+𝑒−𝑥), which maps it to a value between 0 and 1). This value is taken as the probability 𝑝 and the loss will be its binary cross entropy with the “target” probability, i.e., 1 for positive edges and 0 for negative ones. In formulas, the loss for positives is −log𝑝 whereas for negatives it’s −log(1−𝑝). The total loss of due to the negatives is renormalized so it compares with the one of the positives.
逻辑损失将分数解释为边存在的概率。首先要通过逻辑函数(𝑥↦1/(1 + 𝑒−𝑥),将其映射到0到1之间的值)传递每个得分(其域是整个实线)。 将该值作为概率𝑝,并且损失将是具有“目标”概率的二进制交叉熵。正边为1,负边为0。 在公式中,正值的损失为-log𝑝,负数的损失为-log(1-𝑝)。 由于负数导致的总损失被重新归一化,因此可以与正数之一进行比较。

One can see this as the cross entropy between two distributions on the values “edge exists” and “edge doesn’t exist”. One is given by the score (passed through the logistic function), the other has all the mass on “exists” for positives or all the mass on “doesn’t exist” for negatives.
通过使用交叉熵,可以得到边是否存在的分布。一个由得分(通过逻辑函数传递)给出,另一个由阳性存在的“存在”代表所有否定,或由负存在所有质量的“不存在”代表。

This loss function is parameterless and is enabled by setting loss_fn to logistic.
此损失函数是无参的,可以通过将loss_fn设置为logistic来启用。

Softmax loss

The last loss function is designed for when one wants a distribution on the probabilities of some entities being related to a given entity (contrary to just wanting a ranking, as with the ranking loss). For a certain positive 𝑖, its score 𝑠𝑖 and the score 𝑡𝑖,𝑗 of all the corresponding negatives 𝑗 are first converted to probabilities by performing a softmax: 𝑝𝑖∝𝑒𝑠𝑖 and 𝑞𝑖,𝑗∝𝑒𝑡𝑖,𝑗, normalized so that they sum up to 1. Then the loss is the cross entropy between this distribution and the “target” one, i.e., the one that puts all the mass on the positive sample. So, in full, the loss for a single 𝑖 is −log𝑝𝑖, i.e., −𝑠𝑖+log∑𝑗𝑒𝑡𝑖,𝑗.
最后一个损失函数设计用于当人们希望通过给定一个实体,得到其他实体和其相关性的概率的场景。(与仅需要排名,与排名损失相反)。对于某个正例𝑖,首先通过执行softmax:𝑝𝑖∝𝑒𝑠𝑖和𝑞𝑖,𝑗∝𝑒𝑡𝑖,𝑗进行归一化,使得概率和为1。此时损失就是这种分布与“目标”分布之间的交叉熵。即将所有质量置于正样本上的交叉熵。 因此,总的来说,单个𝑖的损失为-log𝑝𝑖,即-𝑠𝑖+ log∑𝑗𝑒𝑡𝑖,𝑗。

This loss is activated by setting loss_fn to softmax.
通过在配置参数中设置loss_fn=softmax来实现配置。

Optimizers

The Adagrad optimization method is used to update all model parameters. Adagrad performs stochastic gradient descent with an adaptive learning rate applied to each parameter inversely proportional to the inverse square magnitude of all previous updates. In practice, Adagrad updates lead to an order of magnitude faster convergence for typical PBG models.

Adagrad优化方法用于更新所有模型参数。Adagrad使用与所有先前更新的平方反比成反比的自适应学习率,以自适应学习率执行随机梯度下降。 实际上,Adagrad更新使典型PBG模型的收敛速度提高了一个数量级。

The initial learning rate for Adagrad is specified by the lr config parameter.A separate learning rate can also be set for non-embeddings using the relation_lr parameter.
Adagrad的初始学习率由config文件里的lr参数指定。还可以使用related_lr参数为非嵌入设置单独的学习率。

Standard Adagrad requires an equal amount of memory for optimizer state as the size of the model, which is prohibitive for the large models targeted by PBG. To reduce optimizer memory usage, a modified version of Adagrad is used that uses a common learning rate for each entity embedding. The learning rate is proportional to the inverse sum of the squared gradients from each element of the embedding, divided by the dimension. Non-embedding parameters (e.g. relation operator parameters) use standard Adagrad.
标准Adagrad要求优化程序状态的内存量与模型大小相同,这对于PBG定位的大型模型是不允许的。为了减少优化器的内存使用,使用了Adagrad的修改版本,该版本对每个实体嵌入使用通用的学习率。学习率与嵌入元素的梯度平方的倒数之和成比例,再除以尺寸。 非嵌入参数(例如关系运算符参数)使用标准Adagrad。

Adagrad parameters are updated asynchronously across worker threads with no explicit synchronization. Asynchronous updates to the Adagrad state (the total squared gradient) appear stable, likely because each element of the state tensor only accumulates positives updates. Optimization is further stabilized by performing a short period of training with a single thread before beginning Hogwild! training, which is tuned by the hogwild_delay parameter.
Adagrad参数在工作线程之间异步更新,而没有显式同步。异步更新使得Adagrad状态(总平方梯度)的更新显得稳定,这可能是因为状态张量的每个元素仅累积正更新。在开始Hogwild之前,可以通过单线程进行短期培训来进一步优化。 培训,具体取决于hogwild_delay参数。

In distributed training, the Adagrad state for shared parameters (e.g. relation operator parameters) are shared via the parameter server using the same asynchronous gradient update as the parameters themselves. Similar to inter-thread synchronization, these asynchronous updates are stable after an initial burn-in period because the total squared gradient strictly accumulates positive values.
分布式训练中,共享参数(例如关系运算符参数)的Adagrad状态通过参数服务器使用与参数本身相同的异步梯度更新进行共享。与线程间同步类似,这些异步更新在初始阶段会不稳定,之后会保持稳定,因为总平方梯度严格累加正值。

🤔

源地址:https://torchbiggraph.readthedocs.io/en/latest/loss_optimization.html


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!