PyTorch-BigGraph I/O 格式

PBG I/O格式

Entity and relation types

The list of entity types (each identified by a string), plus some information about each of them, is given in the entities dictionary in the configuration file. The list of relation types (each identified by its index in that list), plus some data like what their left- and right-hand side entity types are, is in the relations key of the configuration file.
实体类型的列表(每个实体类型由一个字符串标识)以及有关于他们的一些信息由配置文件中的实体字典提供。关系类型的列表(每个都由该列表中的索引标识)以及一些数据(如其左侧和右侧实体类型)在配置文件的relation key中指示。

Entities

The only information that needs to be provided about entities is how many there are in each entity type’s partition. This is done by putting a file named entity_count_type_part.txt for each entity type identified by type and each partition part in the directory specified by the entity_path config parameter. These files must contain a single integer (as text), which is the number of entities in that partition. The directory where all these files reside must be specified as the entity_path key of the configuration file.
对于实体而言,其只需要提供每种实体类型的分区中有多少个实体这样一个信息即可。他通过将文件名称为entity_count_type_part.txt(type表示实体类型,part表示分区id)放到配置文件指定的文件夹来实现,这个文件夹在配置文件中的key为entity_path。 这些文件(entity_count_type1_part1.txt, entity_count_type2_part2.txt…)必须包含一个整数(作为文本),该整数表示在该分区中实体的数量。这些文件所在的目录名称必须指定为在配置文件下的entity_path

It is possible to provide an initial value for the embeddings, by specifying a value for the init_path configuration key, which is the name of a directory that contains files in a format similar to the output format detailed in Checkpoint (possibly without the optimizer state dicts).
通过为init_path配置键指定一个值,可以为嵌入提供一个初始值。init_path值的目录下包含和Checkpoint类似的文件格式。(详情见checkpoint)

If no initial value is provided, it will be auto-generated, with each dimension sampled from the centered normal distribution whose standard deviation can be configured using the init_scale configuration key. For performance reasons the samples of all the entities of a certain type will not be independent.
如果为嵌入提供初始值,它将会自动生成,其每个维的值都是从正态分布中采样得到。可以通过配置文件中的init_scale来调整方差范围。出于性能原因,对于同一个类型下的所有实体的样本将不是独立的。

Edges

For each bucket there must be a file that stores all the edges that fall in that bucket, of all relation types. This means that such a file is only identified by two integers, the partitions of its left- and right-hand side entities. It must be named edges_lhs_rhs.h5 (where lhs and rhs are the above integers), it must be a HDF5 file containing three one-dimensional datasets of the same length, called rel, lhs and rhs. The elements in the 𝑖-th positions in each of them define the 𝑖-th edge: rel identifies the relation type (and thus the left- and right-hand side entity types), lhs and rhs given the indices of the left- and right-hand side entities within their respective partitions.
对于每一个桶,必须要有一个文件来存储该桶下的所有关系类型的所有边信息。这意味这样的一个文件仅由两个整数来进行标识,即它的左侧和右侧实体的分区id。 必须将其命名为edges_lhs_rhs.h5(其中lhs和rhs是左侧分区id和右侧分区id),它必须是HDF5文件,其中包含三个相同长度的一维数据集,分别称为rel,lhs和rhs。
在这三个一维数据集中,第𝑖个位置的元素定义了第i条边: rel定义了关系类型(并因此指定左侧和右侧实体类型),lhs和rhs定义了左右侧的实体在其各自分区下的索引。

To ease future updates to this format, each file must contain the format version in the format_version attribute of the top-level group. The current version is 1.
为了简化以后对该格式的更新,每个文件都必须在顶级组的format_version属性中包含格式版本。 当前版本是1。(这个是H5的格式问题)

If an entity type is unpartitioned (that is, all its entities belong to the same partition), then the edges incident to these entities must still be uniformly spread across all buckets.
如果某一个实体类型是未分区的(即,其所有实体都属于同一分区),则这些实体连接的的边必须仍然均匀地分布在所有存储桶中。

These files, for all buckets, must be stored in the same directory, which must be passed as the edge_paths configuration key. That key can actually contain a list of paths, each pointing to a directory of the format described above: in that case the graph will contain the union of all their edges.
对于所有桶对应的文件都必须存储在同一个目录下。该目录将会被配置到配置文件的edge_pathskey下,该key实际上可以包含一个路径列表,每个路径都指向上述格式的目录:在这种情况下,图将包含其所有边的并集。

Checkpoint

The training’s checkpoints are also its output, and they are written to the directory given as the checkpoint_path parameter in the configuration. Checkpoints are identified by successive positive integers, starting from 1, and all the files belonging to a certain checkpoint have an extra component .vversion between their name and extension (e.g., something.v42.h5 for version 42).
训练的检查点也是其输出,它们被写入配置中指定为checkpoint_path参数的目录中。检查点由从1开始的连续正整数标识,并且属于某个检查点的所有文件的名称和扩展名之间都有一个额外的组件.vversion(例如,版本42的something.v42.h5)。

The latest complete checkpoint version is stored in an additional file in the same directory, called checkpoint_version.txt, which contains a single integer number, the current version.
最新的完整检查点版本存储在同一目录中的另一个文件中,该文件称为checkpoint_version.txt,其中包含一个整数,即当前版本。

Each checkpoint contains a JSON dump of the config that was used to produce it stored in the config.json file.
每一个检查点都为配置文件生成了对应json文件,并将其存储在config.json文件中。

After a new checkpoint version is saved, the previous one will automatically be deleted. In order to periodically preserve some of these versions, set the checkpoint_preservation_interval config flag to the desired period (expressed in number of epochs).
当一个新的检查点版本被保存时,上一个将会自动被删除。如果想要周期性的保存一些检查点,可以在配置文件中设置checkpoint_preservation_interval(以训练轮次数的形式)

Model parameters

The model parameters are stored in a file named model.h5, which is a HDF5 file containing one dataset for each parameter, all of which are located within the model group. Currently, the parameters that are provided are:
• model/relations/idx/operator/side/param with the parameters of each relation’s operator.
• model/entities/type/global_embedding with the per-entity type global embedding.
模型参数存储在名为model.h5的文件中,该文件是一个HDF5文件,其中包含每个参数的一个数据集,所有数据集都位于模型组下。 当前,提供的参数有:
• model/relations/idx/operator/side/param -> 对应每个算子的参数
• model/entities/type/global_embedding ->带有每个实体类型全局嵌入

🤔 这里好像和实际有点出入

Each of these datasets also contains, in the state_dict_key attribute, the key it was stored inside the model state dict. An additional dataset may exist, optimizer/state_dict, which contains the binary blob (obtained through torch.save()) of the state dict of the model’s optimizer.

🤔 H5格式

Finally, the top-level group of the file contains a few attributes with additional metadata. This mainly includes the format version, a JSON-dump of the config and some information about the iteration that produced the checkpoint.
最后,文件的顶级组包含一些带有其他元数据的属性。 主要包括格式版本,配置的JSON转储以及有关生成检查点的迭代的一些信息。

Embeddings

Then, for each entity type and each of its partitions, there is a file embeddings_type_part.h5 (where type is the type’s name and part is the 0-based index of the partition), which is a HDF5 file with two datasets. One two-dimensional dataset, called embeddings, contains the embeddings of the entities, with the first dimension being the number of entities and the second being the dimension of the embedding.
对于每种实体类型及其每个分区,都有一个embeddings_type_part.h5文件(其中type是类型的名称,part是分区的从0开始的索引),它是一个包含两个数据集的HDF5文件。 一个称为嵌入的二维数据集包含实体的嵌入,第一个维度是实体的数量,第二个维度是嵌入的维度。(其实就是实体嵌入的矩阵)(数据集1)

Just like for the model parameters file, the optimizer state dict and additional metadata is also included.
就像model.h5一样,embeedings.h5中还包含了一些关于优化状态等额外的元数据。(数据集2)

源地址:https://torchbiggraph.readthedocs.io/en/latest/input_output.html