PyTorch-BigGraph 数据模型
PBG模型
PBG operates on directed multi-relation multigraphs, whose vertices are called entities. Each edge connects a source to a destination entity, which are respectively called its left- and right-hand side (shortened to LHS and RHS). Multiple edges between the same pair of entities are allowed. Loops, i.e., edges whose left- and right- hand sides are the same, are allowed as well.
PBG作用在有向多关系多图上,其中,图的顶点称为实体。每一条边将源实体连接到目标实体,源实体称为左侧,目标实体称为右侧(缩写为LHS和RHS)同一对实体之间可以有多个边。循环也是允许的,例如,LHS和RHS是相同的边.(A->B B->A)
key point: directed, multi-relation, multigraph
Each entity is of a certain entity type (one and only one type per entity). Thus, the types partition all the entities into disjoint groups. Similarly, each edge also belongs to exactly one relation type. All edges of a given relation type must have all their left-hand side entities of the same entity type and, similarly, all their right-hand side entities of the same entity type (possibly a different entity type than the left-hand side one). This property means that each relation type has a left-hand side entity type and a right-hand side entity type.
每一个实体都具有某种实体类型(每个实体都只有一种类型)。因此,实体的类型将所有的实体划分成不相交的组。类似的,每一条边也只属于个唯一的关系类型。给定关系类型的边的LHS/RHS实体都只属于一个实体类型(左右两侧的实体类型可以不同)。此属性意味着每种关系类型都具有左侧实体类型和右侧实体类型。
key point: entity, relation, entity type, relation type
In this graph, there are 14 entities: 5 of the red entity type, 6 of the yellow entity type and 3 of the blue entity type; there are also 12 edges: 6 of the orange relation type (between red and yellow entities), 3 of the purple entity type (between red and blue entities) and 3 of the green entity type (between yellow and blue entities).
在该图中,存在14个实体:红色实体类型的5个,黄色实体类型的6个和蓝色实体类型的3个; 还有12条边:橙色关联类型的6个(在红色和黄色实体之间),紫色关联类型的3个(在红色和蓝色实体之间)和绿色关联类型的3个(在黄色和蓝色实体之间)。
In order for PBG to operate on large-scale graphs, the graph is broken up into small pieces, on which training can happen in a distributed manner. This is first achieved by further splitting the entities of each type into a certain number of subsets, called partitions. Then, for each relation type, its edges are divided into buckets: for each pair of partitions (one from the left- and one from the right-hand side entity types for that relation type) a bucket is created, which contains the edges of that type whose left- and right-hand side entities are in those partitions.
为了使得PBG在大型图形上运行,图形将被分解成小块,可以在这些小块上进行分布式训练。首先通过将每种类型的实体进一步拆分为一定数量的子集(称为分区)来实现。 然后,对于每种关系类型,它的边被划分成桶:对于每一对分区(对于该关系类型,一个分区是来源于LHS的实体类型,一个分区是来源于RHS的实体类型),一个存储桶被创建,这个存储桶中包含了特定类型的边及对应左、右侧实体类型的实体。
key point: partition, bucket
This graph shows a possible partition of the entities, with red having 3 partitions, yellow having 3, and blue having only one (hence blue is unpartitioned). The edges displayed are those of the orange bucket between the partitions 2 of the red entities and the partition 1 of the yellow entities.
此图显示了实体的可能分区,其中红色具有3个分区,黄色具有3个分区,蓝色仅具有1个分区(因此,蓝色是未分区的)。 显示的边是红色实体的分区2和黄色实体的分区1之间的橙色桶的边。
Note
For technical reasons, at the current state all entity types that appear on the left-hand side of some relation type must be divided into the same number of partitions (except unpartitioned entities). The same must hold for all entity types that appear on the right-hand side. In numpy-speak, it means that the number of partitions of all entities must be broadcastable to the same value.
出于技术原因,当前出现在某种关系类型左侧的所有实体类型必须划分为相同数量的分区(未分区的实体除外)。对于出现在右侧的所有实体类型,必须保持相同的状态。在numpy中,这意味着所有实体的分区数必须可以扩展到相同的值。
举个例子,比如说LHS以及RHS实体类型只有一种,然后关系类型也就是一种,这个时候对LHS划分成2个partition,那么RHS也要划分成2个partition,此时得到的桶个数为2x2=4个
An entity is identified by its type, its partition and its index within the partition (indices must be contiguous, meaning that if there are 𝑁 entities in a type’s partition, their indices lie in the half-open interval [0,𝑁)). An edge is identified by its type, its bucket (i.e., the partitions of its left- and right-hand side entity types) and the indices of its left- and right-hand side entities in their respective partitions. An edge doesn’t have to specify its left- and right-hand side entity types, because they are implicit in the edge’s relation type.
实体由其类型,分区和分区内的索引标识(索引必须是连续的,这意味着如果类型的分区中有N个实体,则其索引位于[0,𝑁)中)。边由其类型,其存储桶(即其左侧和右侧实体类型的分区对)以及其左侧和右侧实体在它们各自分区中的索引来标识。边不必指定其左侧和右侧实体类型,因为它们隐含在边的关系类型中。
Formally, each bucket can be identifies by a pair of integers (𝑖,𝑗), where 𝑖 and 𝑗 are respectively the left- and right-hand side partitions. Inside that bucket, each edge can be identified by a triplet of integers (𝑥,𝑟,𝑦), with 𝑥 and 𝑦 representing respectively the left- and right-hand side entities and 𝑟 representing the relation type. This edge is “interpreted” by first looking up relation type 𝑟 in the configuration, and finding out that it can only have entities of type 𝑒1 on its left-hand side and of type 𝑒2 on its right-hand side. One can then determine the left-hand side entity, which is given by (𝑒1,𝑖,𝑥) (its type, its partition and its index within the partition), and, similarly, the right-hand side one which is (𝑒2,𝑗,𝑦).
形式上,每个桶都可以由一对整数(𝑖,𝑗)标识,其中𝑖和𝑗分别是左侧和右侧分区的索引。 在该桶内,每条边都可以由三元组(𝑥,𝑟,𝑦)标识,其中𝑥和𝑦分别表示左侧和右侧实体在各自分区中的索引,而𝑟表示关系类型。 边的解译通过:首先在配置文件中查找关系类型𝑟,然后发现𝑟的左侧只能是具有𝑒1类型的实体,而在右侧则是具有𝑒2类型的实体, 然后,可以确定左侧实体,该实体由(𝑒1,𝑖,𝑥)(𝑒1=实体类型,i=左侧分区索引,x=该实体在分区内的索引)给出,类似地,右侧实体为(𝑒2 ,𝑗,𝑦)。
源地址:
https://torchbiggraph.readthedocs.io/en/latest/data_model.html
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!