Blocking and Relational Entity Resolution

前情提要

Entity Resolution 主要分成如下四个

  1. Coreference 文本与文本之间找同一个entity
  2. Entity linking 文本与KG中找对应entity -> Integrating New Candidates
  3. Deduplication 一个KG之间的聚类 -> Merging  Ambiguous Entities
  4. Record linkage 两个不同KG之间entity的对应 -> Combining KGs

Blocking

为什么需要Blocking -> reduce the number of comparisons

  • Comparing each entity with all other entities is too computationally demanding –> O(N^2)
  • If partition entities into N ”blocks”– O(N)
  • Make only within block comparisons, so if largest block is log N in size –> O(NlogN^2)

Blocking 分类

Disjoint Blocking: Each mention appears in one block.(=Set Partition)
Non-disjoint Blocking: Mentions can appear in more than one block.

Blocking一般的情形

Blocking衡量指标

  • Efficiency: Blue/Grey
  • Recall: Green/Yellow
  • Precision: Green/Blue
  • Max Canopy Size: 包含mentions个数最多的block的mentions个数

Blocking方式

Feature-based blocking keys

思想: 通过选择实体的某一个或者多个属性作为key,将包含该key的实体放在同一个block下,对每个block再进行entity resolution

例子:
First three characters of last name
City + State + Zip
Character or Token n-grams
Minimum infrequent n-grams

Clustering or sorting

  1. Sorted Neighborhood Blocking
    思想: 通过选择实体的某一个属性,根据该属性对实体进行排序,使用一个窗格,窗格内的实体划分到一个block中去

  2. Canopy Clustering
    Input: Mentions M, x is an entity
    d(x,y), a distance metric
    thresholds T1 > T2

思想:

  1. Pick a random element x from M
  2. Create new canopy Cx using mentions y s.t. d(x,y) < T1
  3. Delete all mentions y from M s.t. d(x,y) < T2
  4. Return to Step 1 if M is not empty

Hashing

思想:

  1. Each block Ci is associated with a hash key hi.
  2. Mention x is hashed to Ci if hash(x) = hi.
  3. Within a block, all pairs are compared.
  4. Each hash function results in disjoint blocks.

Blocking选择考虑因素

  • key的选择:learn the keys, or use expert knowledge/heuristics?
  • Schema awareness: what do we know about the attributes?
  • Key type: exact equality, similarity-based, or hybrid 相同的key放到一个block还是相似放一个block
  • Redundancy: entity in one or multiple blocks? Does matching in multiple blocks increase the match probability
  • Frequency limits
  • Adaptive keys based on frequency
  • Learning keys based on data

Learning to block

Using one or more blocking predicates may be insufficient => Construct blocking predicates by combining simple predicates

Collective Relational Entity Resolution

策略: Using PSL for collective KG ER

  1. Encode ER dependencies in a set of rules
  2. Use soft-logic values to capture similarities
  3. Use logic to capture the constraints

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!