关于正则表达式：有意义的Javascript模糊搜索

Javascript fuzzy search that makes sense

我正在寻找一个模糊搜索JavaScript库来过滤数组。我试过使用Fuzzyset.js和fuse.js，但结果很糟糕(可以在链接的页面上尝试一些演示)。

在对Levenshtein距离进行了一些阅读之后，我对用户输入时所寻找的内容的近似性不甚满意。对于那些不知道的人，系统将计算使两个字符串匹配所需的插入，删除和替换次数。

在Levenshtein-Demerau模型中修复的一个明显缺陷是，blub和boob都被认为与bulb相似(都需要两次替换)。但是，很明显，灯泡与blub比boob更相似，而我刚才提到的模型通过允许换位来认识到这一点。

我想在文本补全的背景下使用它，因此，如果我有一个数组['international', 'splint', 'tinder']，而我的查询是int，则我认为international应该比splint排名更高，即使前者的得分较高(higher = worse )(共10则)。

因此，我正在寻找(并且如果不存在的话将创建)一个执行以下操作的库：

权衡不同的文本操作
根据每个单词在单词中出现的位置，对每个操作进行加权加权(较早的操作比较晚的操作成本更高)
返回按相关性排序的结果列表

有没有人遇到过这样的事情？我意识到，StackOverflow并不是要求软件推荐的地方，但是上面的隐式(不再是！)是：我是否正在以正确的方式考虑？

编辑

我找到了一篇很好的论文(pdf)。一些注释和摘录：

Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions

the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters

对于史密斯-沃特曼距离(维基百科)，"史密斯-沃特曼算法无需查看总序列，而是比较所有可能长度的片段并优化相似性度量。"这是n-gram方法。

A broadly similar metric, which is not based on an edit-distance model, is the
Jaro metric (Jaro 1995; 1989; Winkler
1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings.

A variant of this due to Winkler (1999) also uses the length P of the longest common prefix

(seem to be intended primarily for short strings)

为了完成文本，Monger-Elkan和Jaro-Winkler方法似乎最有意义。 Winkler在Jaro度量标准中的添加有效地更重了单词的开头。而且Monger-Elkan的仿射方面意味着完成一个单词的必要性(这只是一系列加法)不会太不利于它。

结论：

the TFIDF
ranking performed best among several token-based distance
metrics, and a tuned affine-gap edit-distance metric proposed by Monge and Elkan performed best among several
string edit-distance metrics. A surprisingly good distance
metric is a fast heuristic scheme, proposed by Jaro and later extended by Winkler.
This works almost as well as the Monge-Elkan scheme, but
is an order of magnitude faster.
One simple way of combining the TFIDF method and the
Jaro-Winkler is to replace the exact token matches used in
TFIDF with approximate token matches based on the Jaro-
Winkler scheme. This combination performs slightly better than either Jaro-Winkler or TFIDF on average, and occasionally performs much better. It is also close in performance to a learned combination of several of the best metrics
considered in this paper.

相关讨论

我尝试使用像fuse.js这样的现有模糊库，也发现它们很糟糕，所以我写了一个基本上像sublime搜索一样的库。 https://github.com/farzher/fuzzysort

它允许的唯一错字是移调。它非常可靠(1k星，0期)，非常快，可以轻松处理您的案件：

1 2	fuzzysort.go('int', ['international', 'splint', 'tinder']) // [{highlighted: 'international', score: 10}, {highlighted: 'splint', socre: 3003}]

>
</p>
<div class=

相关讨论

好问题！但是我的想法是，与其尝试修改Levenshtein-Demerau，不如尝试使用其他算法或对两种算法的结果进行合并/加权，可能会更好。

令我惊讶的是，Levenshtein-Demerau并没有特别重视与"起始前缀"的完全匹配或接近匹配，但您显然希望用户期望。

我搜索"比Levenshtein更好"，并且发现了以下内容：

Comparison of String Distance Algorithms

这提到了许多"字符串距离"度量。看起来与您的需求特别相关的三个是：

最长公共子字符串距离：两个字符串中必须删除的最小符号数，直到得到的子字符串相同为止。

q-gram距离：两个字符串的N-gram矢量之间的绝对差之和。

雅卡距离：1分钟，表示共享的N-gram与所有观察到的N-gram的商。

也许您可以使用这些度量的加权组合(或最小值)，与Levenshtein一起使用-常见的子字符串，常见的N-gram或Jaccard都将非常希望使用相似的字符串-或仅尝试使用Jaccard？

根据列表/数据库的大小，这些算法可能会适度昂贵。对于我实施的模糊搜索，我使用了可配置数目的N-gram作为数据库中的"检索关键字"，然后运行了昂贵的字符串距离度量以将它们按首选项顺序进行排序。

我写了一些有关SQL模糊字符串搜索的注释。看到：

Fuzzy String Search in SQL

这是我使用过几次的技术...它给出了很好的结果。虽然不能完成您所要求的一切。另外，如果列表很大，这可能会很昂贵。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

get_bigrams = (string) ->
s = string.toLowerCase()
v = new Array(s.length - 1)
for i in [0..v.length] by 1
v[i] = s.slice(i, i + 2)
return v

string_similarity = (str1, str2) ->
if str1.length > 0 and str2.length > 0
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = pairs1.length + pairs2.length
hit_count = 0
for x in pairs1
for y in pairs2
if x is y
hit_count++
if hit_count > 0
return ((2.0 * hit_count) / union)
return 0.0

将两个字符串传递给string_similarity，这将返回0和1.0之间的数字，具体取决于它们之间的相似程度。本示例使用Lo-Dash

使用示例...

1
2
3
4
5
6
7
8
9
10
11
12

query = 'jenny Jackson'
names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']

results = []
for name in names
relevance = string_similarity(query, name)
obj = {name: name, relevance: relevance}
results.push(obj)

results = _.first(_.sortBy(results, 'relevance').reverse(), 10)

console.log results

还有...。

确保您的控制台处于打开状态，否则您将看不到任何东西:)

相关讨论

您可以看看Atom的https://github.com/atom/fuzzaldrin/ lib。

它在npm上可用，具有简单的API，对我来说还可以。

1 2	> fuzzaldrin.filter(['international', 'splint', 'tinder'], 'int'); < ["international","splint"]

相关讨论

这是我模糊匹配的简短函数：

1
2
3
4
5

function fuzzyMatch(pattern, str) {
pattern = '.*' + pattern.split('').join('.*') + '.*';
const re = new RegExp(pattern);
return re.test(str);
}

相关讨论

2019年11月更新。我发现保险丝有一些不错的升级。但是，我无法使用布尔值(即OR，AND等运算符)，也无法使用API搜索界面来过滤结果。

我发现了nextapps-de/flexsearch：https://github.com/nextapps-de/flexsearch，我相信它远远超过了我尝试过的许多其他javascript搜索库，并且它支持bool，过滤搜索和分页。

您可以为搜索数据(即存储空间)输入一个javascript对象列表，并且该API的文档记录也相当详尽：https：//github.com/nextapps-de/flexsearch#api-overview

到目前为止，我已经为近10,000条记录建立了索引，而我的搜索几乎是立即进行的；即每次搜索的时间不明显。

查看我的Google表格加载项Flookup并使用以下功能：

1	Flookup (lookupValue, tableArray, lookupCol, indexNum, threshold, [rank])

参数详细信息是：

lookupValue：您要查找的值

tableArray：要搜索的表

lookupCol：您要搜索的列

indexNum：要从中返回数据的列

threshold：相似度百分比，低于此百分比不应返回数据

rank：第n个最佳匹配项(即，如果第一个匹配项不符合您的喜好)

这确实可以满足您的要求...尽管我不确定第2点。

在官方网站上找到更多信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

jsfiddle http://jsfiddle.net/guest271314/QP7z5/