知识图谱实体解析
为什么需要实体解析?
Ambiguity: Entities with the same name
Variance: Different names for the same entity
这里我们使用record/mention来表示数字世界中的一个实体
什么是实体解析?
The task of disambiguating records that correspond to real world entities across and within datasets.
实体解析的四个方面
Coreference: Unifying mentions of the same entity in a single document (text->text)
Entity Linking: Mapping mentions from text to the canonical entity in a KG (text->KG)
Deduplication: Unifying mentions within a KG into a single view (KG->KG)
Record Linkage: Matching enentites across two different KGs (KG1 -> KG2)
Coreference
举个例子,一篇文档中有许多的代词,需要将这些代词与之前提到的名词进行指代
Entity Linking
举个例子,一篇文档出现诸多实体,需要将这些实体与已有的知识图谱中的实体进行关联
Deduplication
举个例子,一个知识图谱中存在的实体可能是相同或者很相似的,这个时候可以对其进行聚类,使得一类的mention表示同一个实体。
Record Linkage
举个例子,对多个不同的知识图谱之间的相同的实体进行关联。
如何进行实体解析?
Assumption:
Each record/mention is associated with a single real world entity.
If two records/mentions are identical, then they are true matches.
Clustering
从一系列records/mentions中计算实体是一个聚类问题
ER vs (Multi-relational) Clustering
- In typical clustering algorithms (k-means, LDA, etc.) number of clusters is a constant or sublinear in R.
- In ER: number of clusters is linear in R, and average cluster size is a constant. Significant fraction of clusters are singletons.
Classification
对一对实体确定匹配或非匹配是一个分类问题
Imbalanced: typically O(R) matches, O(R^2) non-matches
实体解析相关模型
Blocking
目的: 为了减少匹配次数
Naive pairwise: |R|2 pairwise comparisons
example:1000 business listings each from 1,000 different cities across the
world => 1M listings in total => 1 trillion comparisons
If we use Blocking Criterion: City => 1 billion comparisons
Problem Statement:
Input: Set of records R
Output: Set of blocks/canopies
Variants:
Disjoint Blocking
Non-disjoint Blocking
Evalution Metrics:
Efficiency = num of pairs compared / total number of pairs in RxR
Recall = num of true matches compared / num of true matches in RxR
Precision = um of true matches compared / num of matches compared
Max canopy size: the size of the largest block
Blocking方式:
Blocking predicates (keys), e.g. First three characters of last name, City + State + Zip ;
该方式存在的问题: Using one or more blocking predicates may be insufficient, 也就是说根据key来blocking的话,很有可能出现两种结果,1是有些key下的block还是有很多的record, 2是NULL values may create large blocks.
解决方式: Construct blocking predicates by combining simple predicates
Matching
Use pairwise predictor to resolve co-referent mentions
Problem Statement: Given a vector of component-wise similarities for a pair of records (x,y), compute P(x and y match)
Type I Error => False Positive
Type II Error => False Negative
ML Pairwise Approaches:
Supervised machine learning algorithms:
Decision trees
Support vector machines
Ensembles of classifiers
Conditional Random Fields (CRF)
…
存在的问题:
Training set generation
Imbalanced classes – many more negatives than positives
Misclassification cost
Validation
Apply constraints like 1-1 matching, identify canonical form
Important forms of constraints:
• Exclusivity: If M1 matches with M2, then M3 cannot match with M2
• Transitivity: If M1 and M2 match, M2 and M3 match, then M1 and M3 match
• Functional Dependency: If M1 and M2 match, then M3 and M4 must match
String matching problem
Problem Statement:
Given X and Y sets of strings, Find pairs (x, y) such that both x and y refer to the same real world entity
为什么字符串不能完美匹配?
typos, OCR errors, formatting conventions, abbreviations, nick names, word order
评价字符串相似的指标
Sequence based Measures
- Edit Distance/Levenshtein Distance
lev(x, y) is the minimum cost to transform x to y
Online calculator: http://planetcalc.com/1721/
Set-Based Measures
方式: Generate set of tokens from the strings and measure similarity between the sets of tokens
tokenize a string => “david smith” 3-grams
- Jaccard Measure
- TF/IDF Measure: distinguishing term should carry more weight
Hybrid Measures
Do the set-based thing but use a similiarity metric for each element of the set
Phonetic Similarity Measures
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!