PyTorch-BigGraph 批准备

PBG 批准备

This section presents how the training data is prepared and organized in batches before the loss is calculated and optimized on each of them.
本节介绍了如何在计算和优化损失之前如何对训练数据进行准备和组织。

Training proceeds by iterating over the edges, through various nested loops. The outermost one walks through so-called epochs. Each epoch is independent and essentially equivalent to every other one. Their goal is to repeat the inner loop until convergence. Each epoch visits all the edges exactly once. The number of epochs is specified in the num_epochs configuration parameter.
训练过程通过不断嵌套循环各条边来进行。所有边都被训练过一次叫做一个epoch,每一个epoch都是互相独立且平等的。他们的目标都是为了重复训练边最终达到收敛的效果。每一个epoch只访问每一条一次。在配置文件中的num_epochs指定epoch数目。

The edges are partitioned into edge sets (one for each directory of the edge_paths configuration key) and, within each epoch, the edge sets are traversed in order.
图中的边被划分为边集合的形式(edge_paths目录下的每个文件表示一个边集合),在每一个epoch下,这些边集合按照顺序访问遍历。

When iterating over an edge set, each of its buckets is first divided into equally sized chunks: each chunk spans a contiguous interval of edges (in the order they are stored in the files) and the number of chunks can be tweaked using the num_edge_chunks configuration key. The training first operates on the all the first chunks of all buckets, then on all of their second chunks, and so on.
当在某一个边集合上进行迭代训练的时候,每一个桶(一个边集合)首先被划分成相同大小的chunks,每一个chunk跨度连续的边间隔(按它们存储在文件中的顺序),可以在配置文件中使用num_edge_chunks来设置chunk的大小。训练首先对桶中的第一个chunk进行训练,然后对第二个,以此类推。

Next, the algorithm iterates over the buckets. The order in which buckets are processed depends on the value of the bucket_order configuration key. In addition to a random permutation, there are methods that try to have successive buckets share a common partition: this allows for that partition to be reused, thus allowing it to be kept in memory rather than being unloaded and another one getting loaded in its place. (In distributed mode, the various trainer processes operate on the buckets at the same time, thus the iteration is managed differently).
接下来,该算法迭代训练每一个桶。桶训练顺序取决于配置文件中的bucket_order。除了随机排列之外,还有一些方法尝试使连续的存储桶共享一个公共分区,也就是说允许这个共享分区被重用。通过这样其允许这这些边数据保存在内存中而不是被卸载然后被其他的边数据取代。(在分布式模式下,各种训练器进程同时在存储桶上运行训练,因此对迭代的管理不同)。

Once the trainer has fixed a given chunk and a certain bucket, its edges are finally loaded from disk. When evaluating during training, a subset of these edges is withheld (such subset is the same for all epochs). The remaining edges are immediately uniformly shuffled and then split into equal parts. These parts are distributed among a pool of processes, so that the training can proceed in parallel on all of them at the same time. These subprocesses are “Hogwild!” workers, which do not synchronize their computations or memory accesses. The number of such workers is determined by the workers parameter.
一旦训练器获得了一个给定的chunk和bucket,其chuck的边最终从磁盘上加载到内存。在训练之前,边的一部分被保留下载作为之后的评估使用。(对于所有的epoch而言,这些边是一样的),其余的边都作为训练数据,他们被均匀地混洗,然后分成相等的部分分发到进程池里进行训练。因此训练是可以在所有进程上并行运行的。这些子进程被称为“Hogwild!” worker,他们的计算和内存使用并不同步进行。worker的数目在配置文件中的worfers参数指定。

key point: 这里worker就可以当成是进程数目,几个worker表示可以同时几个进程并行训练数据。

The way each worker trains on its set of edges depends on whether dynamic relations are in use. The simplest scenario is if they are, in which case the edges are split into contiguous batches (each one having the size specified in the batch_size configuration key, except possibly the last one which could be smaller). Training is then performed on that batch before moving on to the next one.
worker训练边集合的方式取决于是否使用动态关系。最简单的情况是,如果边被分成连续的批次,(每个批次具有一样的大小,并且其在配置文件的batch_size中指定,最后一个批次可能会小些。)训练分批次依次进行训练。

When dynamic relations are not in use, however, the loss can only be computed on a set of edges that are all of the same type. Thus the worker first randomly samples a relation type, with probability proportional to the number of edges of that type that are left in the pool. It then takes the first batch_size relations of that type (or fewer, if not enough of them are left), removes them from the pool and performs training on them.
当不使用动态关系的时候,损失函数将只能用于同种关系类型的边计算。因此worker首先需要随机的选出某一种关系类型的样本,(挑选概率和池子残留的边成正比)。随后,其采用该类型的第一个batch_size(如果剩余的不多的话,数目可能会很少),从池子中移除然后进行训练。

源地址:https://torchbiggraph.readthedocs.io/en/latest/batch_preparation.html


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!