知识图谱实体解析

为什么需要实体解析?

Ambiguity: Entities with the same name
Variance: Different names for the same entity

这里我们使用record/mention来表示数字世界中的一个实体

什么是实体解析?

The task of disambiguating records that correspond to real world entities across and within datasets.

实体解析的四个方面

Coreference: Unifying mentions of the same entity in a single document (text->text)
Entity Linking: Mapping mentions from text to the canonical entity in a KG (text->KG)
Deduplication: Unifying mentions within a KG into a single view (KG->KG)
Record Linkage: Matching enentites across two different KGs (KG1 -> KG2)

Coreference

举个例子,一篇文档中有许多的代词,需要将这些代词与之前提到的名词进行指代

Entity Linking

举个例子,一篇文档出现诸多实体,需要将这些实体与已有的知识图谱中的实体进行关联

Deduplication

举个例子,一个知识图谱中存在的实体可能是相同或者很相似的,这个时候可以对其进行聚类,使得一类的mention表示同一个实体。

Record Linkage

举个例子,对多个不同的知识图谱之间的相同的实体进行关联。

如何进行实体解析?

Assumption:
Each record/mention is associated with a single real world entity.
If two records/mentions are identical, then they are true matches.

Clustering

从一系列records/mentions中计算实体是一个聚类问题
ER vs (Multi-relational) Clustering

  • In typical clustering algorithms (k-means, LDA, etc.) number of clusters is a constant or sublinear in R.
  • In ER: number of clusters is linear in R, and average cluster size is a constant. Significant fraction of clusters are singletons.

Classification

对一对实体确定匹配或非匹配是一个分类问题

Imbalanced: typically O(R) matches, O(R^2) non-matches

实体解析相关模型

Blocking

目的: 为了减少匹配次数
Naive pairwise: |R|2 pairwise comparisons
example:1000 business listings each from 1,000 different cities across the
world => 1M listings in total => 1 trillion comparisons
If we use Blocking Criterion: City => 1 billion comparisons

Problem Statement:
Input: Set of records R
Output: Set of blocks/canopies

Variants:
Disjoint Blocking
Non-disjoint Blocking

Evalution Metrics:
Efficiency = num of pairs compared / total number of pairs in RxR
Recall = num of true matches compared / num of true matches in RxR
Precision = um of true matches compared / num of matches compared
Max canopy size: the size of the largest block

Blocking方式:
Blocking predicates (keys), e.g. First three characters of last name, City + State + Zip ;
该方式存在的问题: Using one or more blocking predicates may be insufficient, 也就是说根据key来blocking的话,很有可能出现两种结果,1是有些key下的block还是有很多的record, 2是NULL values may create large blocks.
解决方式: Construct blocking predicates by combining simple predicates

Matching

Use pairwise predictor to resolve co-referent mentions

Problem Statement: Given a vector of component-wise similarities for a pair of records (x,y), compute P(x and y match)

Type I Error => False Positive
Type II Error => False Negative

ML Pairwise Approaches:
Supervised machine learning algorithms:
Decision trees
Support vector machines
Ensembles of classifiers
Conditional Random Fields (CRF)

存在的问题:
Training set generation
Imbalanced classes – many more negatives than positives
Misclassification cost

Validation

Apply constraints like 1-1 matching, identify canonical form

Important forms of constraints:
• Exclusivity: If M1 matches with M2, then M3 cannot match with M2
• Transitivity: If M1 and M2 match, M2 and M3 match, then M1 and M3 match
• Functional Dependency: If M1 and M2 match, then M3 and M4 must match

String matching problem

Problem Statement:
Given X and Y sets of strings, Find pairs (x, y) such that both x and y refer to the same real world entity

为什么字符串不能完美匹配?

typos, OCR errors, formatting conventions, abbreviations, nick names, word order

评价字符串相似的指标

Sequence based Measures

Set-Based Measures

方式: Generate set of tokens from the strings and measure similarity between the sets of tokens

tokenize a string => “david smith” 3-grams

  • Jaccard Measure
  • TF/IDF Measure: distinguishing term should carry more weight

Hybrid Measures

Do the set-based thing but use a similiarity metric for each element of the set

Phonetic Similarity Measures


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!