知识图谱字符串匹配
字符串匹配问题
Problem Statement:
Given X and Y sets of strings, Find pairs (x, y) such that both x and y refer to the same real world entity.
Then use precision(fraction of pairs found that are correct) and recall(fraction of pairs found) to evaluate algorithms
Why can’t match perfectly
Typos “Joh” vs. “Jhon”
OCR errors “J0hn” vs “John”
Formatting conventions ”03/17” vs “March 17”
Abbreviations “J. S. Sargent” vs “John Singer Sargent”
Nick name ”John” vs “Jock”
Word order “Sargent, John S.” vs “John S. Sargent”
因此,转而计算字符串的相似度
Similiarity Measure
Similarity(x, y) in [0, 1] is better than distance(x, y) in [0, ∞)
Types of Similarity Metrics:
- Sequence based
- Set based
- Hybrid
- Phonetic
Sequence Based Metrics
Edit Distance/Levenshtein Distance
Online calculator: http://planetcalc.com/1721/
lev(x, y) is the minimum cost to transform x to y(insert,delete,substitute)
原理: 动态规划 d(i,j)
case1: xi = yj => d(i,j) = d(i-1,j-1)
case2: xi != yj => d(i,j) = min( d(i-1,j) + 1 , d(i,j-1) + 1 , d(i-1,j-1) + 1)
缺陷:
Too high a cost for deleting a sequence of characters, so it is not suitable for abbreviation cases (example: lev(John Singer Sargent,John S. Sargent ) = 5, lev(John Singer Sargent,John Klinger Sargent ) = 5 )
Needleman-Wunch Measure
Generalization of levenstein(x, y), 为了解决lev带来的缩写问题的一种新的度量方式
得到的结果是两个string相似的评分,而不是距离
缺陷:
Longer gaps are penalized more, bad for names (example:nw(John Singer Sargent,John S. Sargent) = 25, nw(John Stanislaus Sargent,John S. Sargent)
Affine Gap Measure
Generalization of needleman-wunch(x, y),为了解决Needleman-Wunch带来的长gap问题
Intuition: two small gaps are worse than one large one
Smith-Waterman
可以说是Needleman-Wunch的local版本
Needleman-Wunch: Fully align sequences, opening gaps as needed,也就是说两边空格都被考虑进去了的
Jaro Similarity Measure
Get points for having characters in common
• but only if they are “close by”
Get points for common characters in the same order
• lose points for transpositions
???
Jaro-Winkler Measure
????
Set-Based Metrics
Generate set of tokens from the strings and then measure similarity between the sets of tokens
Jaccard Measure
TF-IDF Measure
Hybrid Similarity Measures
Do the set-based thing but use a similiarity metric for each element of the set
Generalized Jaccard Measure
The Soft TF/IDF Measure
Monge-Elkan Measure
Phonetic Similarity
Soundex
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!