在Python中检查存在于较长字符串中的模糊/近似子字符串?

Checking fuzzy/approximate substring existing in a longer string, in Python?

使用像leveinstein(leveinstein或difflib)这样的算法,很容易找到近似匹配。

1
2
3
>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

模糊匹配可以通过根据需要确定阈值来检测。

当前要求:根据较大字符串中的阈值查找模糊子字符串。

如。

1
2
3
large_string ="thelargemanhatanproject is a great project in themanhattincity"
query_string ="manhattan"
#result ="manhatan","manhattin" and their indexes in large_string

一种强力解决方案是生成长度为n-1到n+1(或其他匹配长度)的所有子字符串,其中n是查询字符串的长度,并逐个使用levenstein并查看阈值。

在python中是否有更好的解决方案可用,最好是python 2.7中包含的模块,或者外部可用的模块。

更新:python regex模块工作得很好,尽管对于模糊子串的情况,它比内置的re模块慢一点,这是由于额外的操作而产生的明显结果。所需的输出是良好的,模糊度的大小控制可以很容易地定义。

1
2
3
4
>>> import regex
>>> input ="Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>


使用difflib.SequenceMatcher.get_matching_blocks怎么样?

1
2
3
4
5
6
7
8
9
10
11
>>> import difflib
>>> large_string ="thelargemanhatanproject"
>>> query_string ="manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string ="banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

更新

1
2
3
4
5
6
7
8
9
10
11
12
13
import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string ="thelargemanhatanproject is a great project in themanhattincity"
query_string ="manhattan"
print list(matches(large_string, query_string, 0.8))

以上代码打印:['manhatan', 'manhattn']


我使用模糊匹配的阈值模糊匹配和模糊搜索模糊提取词的匹配。

process.extractBests接受查询、单词列表和截止分数,并返回截止分数以上的匹配和分数元组列表。

find_near_matchesprocess.extractBests的结果,返回单词的起始和结束索引。我使用索引构建单词,并使用构建的单词在大字符串中查找索引。find_near_matchesmax_l_dist是"Levenshtein距离",必须根据需要进行调整。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from fuzzysearch import find_near_matches
from fuzzywuzzy import process

large_string ="thelargemanhatanproject is a great project in themanhattincity"
query_string ="manhattan"

def fuzzy_extract(qs, ls, threshold):
    '''fuzzy matches 'qs' in 'ls' and returns list of
    tuples of (word,index)
    '''

    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
        print('word {}'.format(word))
        for match in find_near_matches(qs, word, max_l_dist=1):
            match = word[match.start:match.end]
            print('match {}'.format(match))
            index = ls.find(match)
            yield (match, index)

试验;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
print('query: {}
string: {}'
.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
    print('match: {}
index: {}'
.format(match, index))

query_string ="citi"
print('query: {}
string: {}'
.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}
index: {}'
.format(match, index))

query_string ="greet"
print('query: {}
string: {}'
.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}
index: {}'
.format(match, index))

输出;查询:曼哈顿字符串:大曼哈坦项目是曼哈坦市的一个伟大项目。比赛:曼哈坦索引:8匹配:manhattin索引:49

查询:花旗字符串:大曼哈坦项目是曼哈坦市的一个伟大项目。比赛:城市索引:58

查询:问候语字符串:大曼哈坦项目是曼哈坦市的一个伟大项目。比赛:伟大索引:29


新的regex库很快就会被替换,它包含了模糊匹配。

https://pypi.python.org/pypi/regex/

模糊匹配语法看起来相当有表现力,但这将使您能够匹配一个或更少的插入/添加/删除。

1
2
import regex
regex.match('(amazing){e<=1}', 'amaging')


最近我为python编写了一个对齐库:https://github.com/eseraygun/python-alliance

使用它,您可以在任意一对序列上使用任意评分策略执行全局和局部对齐。实际上,在您的情况下,您需要半局部对齐,因为您不关心query_string的子字符串。我已经在下面的代码中使用局部对齐和一些启发式方法模拟了半局部算法,但是很容易扩展库以获得正确的实现。

下面是为您的案例修改的自述文件中的示例代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner

large_string ="thelargemanhatanproject is a great project in themanhattincity"
query_string ="manhattan"

# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)

# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)

# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)

# Iterate over optimal alignments and print them.
for encoded in encodeds:
    alignment = v.decodeSequenceAlignment(encoded)

    # Simulate a semi-local alignment.
    if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
        continue
    if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
        continue
    if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
        continue

    print alignment
    print 'Alignment score:', alignment.score
    print 'Percent identity:', alignment.percentIdentity()
    print

minScore=5的输出如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

如果你去掉minScore的论点,你只会得到最好的得分。

1
2
3
4
5
6
7
8
9
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

请注意,库中的所有算法都具有O(n * m)时间复杂性,nm是序列的长度。


上面的方法很好,但是我需要在很多干草中找到一个小针,最后像这样接近它:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs

needle ="this is the string we want to find"
hay    ="text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"

needle_length  = len(needle.split())
max_sim_val    = 0
max_sim_string = u""

for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
    hay_ngram = u"".join(ngram)
    similarity = SM(None, hay_ngram, needle).ratio()
    if similarity > max_sim_val:
        max_sim_val = similarity
        max_sim_string = hay_ngram

print max_sim_val, max_sim_string

产量:

1
0.72972972973 this string is the one we wanted to find