PyTorch-BigGraph 数据模型


PBG operates on directed multi-relation multigraphs, whose vertices are called entities. Each edge connects a source to a destination entity, which are respectively called its left- and right-hand side (shortened to LHS and RHS). Multiple edges between the same pair of entities are allowed. Loops, i.e., edges whose left- and right- hand sides are the same, are allowed as well.
PBG作用在有向多关系多图上,其中,图的顶点称为实体。每一条边将源实体连接到目标实体,源实体称为左侧,目标实体称为右侧(缩写为LHS和RHS)同一对实体之间可以有多个边。循环也是允许的,例如,LHS和RHS是相同的边.(A->B B->A)

key point: directed, multi-relation, multigraph

Each entity is of a certain entity type (one and only one type per entity). Thus, the types partition all the entities into disjoint groups. Similarly, each edge also belongs to exactly one relation type. All edges of a given relation type must have all their left-hand side entities of the same entity type and, similarly, all their right-hand side entities of the same entity type (possibly a different entity type than the left-hand side one). This property means that each relation type has a left-hand side entity type and a right-hand side entity type.

key point: entity, relation, entity type, relation type

In this graph, there are 14 entities: 5 of the red entity type, 6 of the yellow entity type and 3 of the blue entity type; there are also 12 edges: 6 of the orange relation type (between red and yellow entities), 3 of the purple entity type (between red and blue entities) and 3 of the green entity type (between yellow and blue entities).
在该图中,存在14个实体:红色实体类型的5个,黄色实体类型的6个和蓝色实体类型的3个; 还有12条边:橙色关联类型的6个(在红色和黄色实体之间),紫色关联类型的3个(在红色和蓝色实体之间)和绿色关联类型的3个(在黄色和蓝色实体之间)。

In order for PBG to operate on large-scale graphs, the graph is broken up into small pieces, on which training can happen in a distributed manner. This is first achieved by further splitting the entities of each type into a certain number of subsets, called partitions. Then, for each relation type, its edges are divided into buckets: for each pair of partitions (one from the left- and one from the right-hand side entity types for that relation type) a bucket is created, which contains the edges of that type whose left- and right-hand side entities are in those partitions.
为了使得PBG在大型图形上运行,图形将被分解成小块,可以在这些小块上进行分布式训练。首先通过将每种类型的实体进一步拆分为一定数量的子集(称为分区)来实现。 然后,对于每种关系类型,它的边被划分成桶:对于每一对分区(对于该关系类型,一个分区是来源于LHS的实体类型,一个分区是来源于RHS的实体类型),一个存储桶被创建,这个存储桶中包含了特定类型的边及对应左、右侧实体类型的实体。

key point: partition, bucket

This graph shows a possible partition of the entities, with red having 3 partitions, yellow having 3, and blue having only one (hence blue is unpartitioned). The edges displayed are those of the orange bucket between the partitions 2 of the red entities and the partition 1 of the yellow entities.
此图显示了实体的可能分区,其中红色具有3个分区,黄色具有3个分区,蓝色仅具有1个分区(因此,蓝色是未分区的)。 显示的边是红色实体的分区2和黄色实体的分区1之间的橙色桶的边。

For technical reasons, at the current state all entity types that appear on the left-hand side of some relation type must be divided into the same number of partitions (except unpartitioned entities). The same must hold for all entity types that appear on the right-hand side. In numpy-speak, it means that the number of partitions of all entities must be broadcastable to the same value.

An entity is identified by its type, its partition and its index within the partition (indices must be contiguous, meaning that if there are 𝑁 entities in a type’s partition, their indices lie in the half-open interval [0,𝑁)). An edge is identified by its type, its bucket (i.e., the partitions of its left- and right-hand side entity types) and the indices of its left- and right-hand side entities in their respective partitions. An edge doesn’t have to specify its left- and right-hand side entity types, because they are implicit in the edge’s relation type.

Formally, each bucket can be identifies by a pair of integers (𝑖,𝑗), where 𝑖 and 𝑗 are respectively the left- and right-hand side partitions. Inside that bucket, each edge can be identified by a triplet of integers (𝑥,𝑟,𝑦), with 𝑥 and 𝑦 representing respectively the left- and right-hand side entities and 𝑟 representing the relation type. This edge is “interpreted” by first looking up relation type 𝑟 in the configuration, and finding out that it can only have entities of type 𝑒1 on its left-hand side and of type 𝑒2 on its right-hand side. One can then determine the left-hand side entity, which is given by (𝑒1,𝑖,𝑥) (its type, its partition and its index within the partition), and, similarly, the right-hand side one which is (𝑒2,𝑗,𝑦).
形式上,每个桶都可以由一对整数(𝑖,𝑗)标识,其中𝑖和𝑗分别是左侧和右侧分区的索引。 在该桶内,每条边都可以由三元组(𝑥,𝑟,𝑦)标识,其中𝑥和𝑦分别表示左侧和右侧实体在各自分区中的索引,而𝑟表示关系类型。 边的解译通过:首先在配置文件中查找关系类型𝑟,然后发现𝑟的左侧只能是具有𝑒1类型的实体,而在右侧则是具有𝑒2类型的实体, 然后,可以确定左侧实体,该实体由(𝑒1,𝑖,𝑥)(𝑒1=实体类型,i=左侧分区索引,x=该实体在分区内的索引)给出,类似地,右侧实体为(𝑒2 ,𝑗,𝑦)。


