PBG概要

PyTorch-BigGraph (PBG) is a distributed system for learning graph embeddings for large graphs, particularly big web interaction graphs with up to billions of entities and trillions of edges.PBG now supports GPU training.
PyTorch-BigGraph（PBG）是一个用于学习学习大图（节点、关系众多），特别是那些具有多达数十亿个实体和数万亿条边的大型Web交互图嵌入的分布式系统。如今PBG也支持了GPU训练的过程。

key point: PBG, distributed, big graph, embedding, GPU

PBG was introduced in the PyTorch-BigGraph: A Large-scale Graph Embedding Framework paper, presented at the SysML conference in 2019.
PBG是于2019年发表在SysML会议上的一篇论文 PYTORCH-BIGGRAPH: A LARGE-SCALE GRAPH EMBEDDING SYSTEM被首次提出
论文地址: https://mlsys.org/Conferences/2019/doc/2019/71.pdf

PBG trains on an input graph by ingesting its list of edges, each identified by its source and target entities and, possibly, a relation type. It outputs a feature vector (embedding) for each entity, trying to place adjacent entities close to each other in the vector space, while pushing unconnected entities apart. Therefore, entities that have a similar distribution of neighbors will end up being nearby.
PBG通过提取图的边列表来进行embedding训练，每条边由其源实体,目标实体以及一个可能的关系类型作为标识。PBG对每个实体输出一个特征向量（嵌入）。它通过将相邻的实体在向量空间中彼此靠近放置，不相邻的实体彼此远离放置。通过这样的手段，相似的实体最终的嵌入互相靠近。

key point: 结点向量间的距离能够衡量原图中的邻接关系强弱。

It is possible to configure each relation type to calculate this “proximity score” in a different way, with the parameters (if any) learned during training. This allows the same underlying entity embeddings to be shared among multiple relation types.
在训练过程中，可以通过配置每种关系类型来计算接近程度以及参数学习。如此一来一些相同的实体嵌入可以被多种关系类型进行共享。

The generality and extensibility of its model allows PBG to train a number of models from the knowledge graph embedding literature, including TransE, RESCAL, DistMult and ComplEx.
模型的通用性和可扩展性使得PBG可以从知识图嵌入文献中训练许多模型，包括TransE，RESCAL，DistMult和ComplEx。

key point: embedding algorithm: TransE, RESCAL, DistMult, ComplEx

PBG is designed with scale in mind, and achieves it through:
PBG的设计考虑了规模，并通过以下方法实现了规模：

1.graph partitioning, so that the model does not have to be fully loaded into memory
2.multi-threaded computation on each machine
3.distributed execution across multiple machines (optional), all simultaneously operating on disjoint parts of the graph
4.batched negative sampling, allowing for processing >1 million edges/sec/machine with 100 negatives per edge

1.图形分区，因此模型不必完全加载到内存中
2.每台机器上的多线程计算
3.在多台计算机上分布执行（可选），所有这些操作同时在图的不连续部分上运行(数据分布)
4.批量负采样，可处理 >1,000,00 个边/秒/机器, 每条边100个负采样

key point: partition, distributed, batched negative sampling

PBG is not optimized for small graphs. If your graph has fewer than 100,000 nodes, consider using KBC with the ComplEx model and N3 regularizer. KBC produces state-of-the-art embeddings for graphs that can fit on a single GPU. Compared to KBC, PyTorch-BigGraph enables learning on very large graphs whose embeddings wouldn’t fit in a single GPU or a single machine, but may not produce high-quality embeddings for small graphs without careful tuning.
PBG并未针对小型图形进行优化。如果图少于100,000个节点，请考虑将KBC与ComplEx模型和N3正则化器一起使用。 KBC为可放在单个GPU上的图形生成最先进的嵌入。与KBC相比，PyTorch-BigGraph支持在非常大的图上进行学习，这些图的嵌入无法在单个GPU或单个机器中完成，但是PBG存在的缺点是其可能无法为小型图生成高质量的嵌入如果没有很好的调参的话。

源地址：
https://github.com/facebookresearch/PyTorch-BigGraph

PyTorch-BigGraph

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

PyTorch-BigGraph 数据模型 Previous

Scrapy 实例 - TripAdvisor信息抓取 Next

PyTorch-BigGraph 概要

PBG概要