知识图谱实体解析

为什么需要实体解析？

Ambiguity: Entities with the same name
Variance: Different names for the same entity

这里我们使用record/mention来表示数字世界中的一个实体

什么是实体解析？

The task of disambiguating records that correspond to real world entities across and within datasets.

实体解析的四个方面

Coreference: Unifying mentions of the same entity in a single document (text->text)
Entity Linking: Mapping mentions from text to the canonical entity in a KG (text->KG)
Deduplication: Unifying mentions within a KG into a single view (KG->KG)
Record Linkage: Matching enentites across two different KGs (KG1 -> KG2)

Coreference

举个例子，一篇文档中有许多的代词，需要将这些代词与之前提到的名词进行指代

Entity Linking

举个例子，一篇文档出现诸多实体，需要将这些实体与已有的知识图谱中的实体进行关联

Deduplication

举个例子，一个知识图谱中存在的实体可能是相同或者很相似的，这个时候可以对其进行聚类，使得一类的mention表示同一个实体。

Record Linkage

举个例子，对多个不同的知识图谱之间的相同的实体进行关联。

如何进行实体解析？

Assumption:
Each record/mention is associated with a single real world entity.
If two records/mentions are identical, then they are true matches.

Clustering

从一系列records/mentions中计算实体是一个聚类问题
ER vs (Multi-relational) Clustering

In typical clustering algorithms (k-means, LDA, etc.) number of clusters is a constant or sublinear in R.
In ER: number of clusters is linear in R, and average cluster size is a constant. Significant fraction of clusters are singletons.

Classification

对一对实体确定匹配或非匹配是一个分类问题

Imbalanced: typically O(R) matches, O(R^2) non-matches

实体解析相关模型

Blocking

目的: 为了减少匹配次数
Naive pairwise: |R|2 pairwise comparisons
example:1000 business listings each from 1,000 different cities across the
world => 1M listings in total => 1 trillion comparisons
If we use Blocking Criterion: City => 1 billion comparisons

Problem Statement:
Input: Set of records R
Output: Set of blocks/canopies

Variants:
Disjoint Blocking
Non-disjoint Blocking

Evalution Metrics:
Efficiency = num of pairs compared / total number of pairs in RxR
Recall = num of true matches compared / num of true matches in RxR
Precision = um of true matches compared / num of matches compared
Max canopy size: the size of the largest block

Blocking方式:
Blocking predicates (keys), e.g. First three characters of last name, City + State + Zip ;
该方式存在的问题: Using one or more blocking predicates may be insufficient, 也就是说根据key来blocking的话，很有可能出现两种结果，1是有些key下的block还是有很多的record, 2是NULL values may create large blocks.
解决方式： Construct blocking predicates by combining simple predicates

Matching

Use pairwise predictor to resolve co-referent mentions

Problem Statement: Given a vector of component-wise similarities for a pair of records (x,y), compute P(x and y match)

Type I Error => False Positive
Type II Error => False Negative

ML Pairwise Approaches:
Supervised machine learning algorithms:
Decision trees
Support vector machines
Ensembles of classifiers
Conditional Random Fields (CRF)
…

存在的问题:
Training set generation
Imbalanced classes – many more negatives than positives
Misclassification cost

Validation

Apply constraints like 1-1 matching, identify canonical form

Important forms of constraints:
• Exclusivity: If M1 matches with M2, then M3 cannot match with M2
• Transitivity: If M1 and M2 match, M2 and M3 match, then M1 and M3 match
• Functional Dependency: If M1 and M2 match, then M3 and M4 must match

String matching problem

Problem Statement:
Given X and Y sets of strings, Find pairs (x, y) such that both x and y refer to the same real world entity

为什么字符串不能完美匹配？

typos, OCR errors, formatting conventions, abbreviations, nick names, word order

评价字符串相似的指标

Sequence based Measures

Edit Distance/Levenshtein Distance
lev(x, y) is the minimum cost to transform x to y
Online calculator: http://planetcalc.com/1721/

Set-Based Measures

方式: Generate set of tokens from the strings and measure similarity between the sets of tokens

tokenize a string => “david smith” 3-grams

Jaccard Measure
TF/IDF Measure: distinguishing term should carry more weight

Hybrid Measures

Do the set-based thing but use a similiarity metric for each element of the set

Phonetic Similarity Measures

Knowledge Graph

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

知识图谱查询 Previous

知识图谱表示 Next