关于正则表达式：是否值得使用Python的re.compile？

Is it worth using Python's re.compile?

在Python中使用正则表达式编译有什么好处吗？

1 2	h = re.compile('hello') h.match('hello world')

1	re.match('hello', 'hello world')

相关讨论

除此之外，在2.6 re.sub中不会采用标志参数......
我刚刚遇到一个案例，使用re.compile进行了10-50倍的改进。道德是，如果你有很多正则表达式(超过MAXCACHE = 100)并且你每次使用它们很多次(并且它们之间的间隔超过MAXCACHE正则表达式，以便每个正则表达式从缓存中刷新：所以使用很多次都是同一个，然后转到下一个不算)，那么它肯定会有助于编译它们。否则，它没有任何区别。
需要注意的一件小事是，对于不需要正则表达式的字符串，in字符串子字符串测试要快得多：>python -m timeit -s"import re""re.match('hello', 'hello world')" 1000000 loops, best of 3: 1.41 usec per loop >python -m timeit"x = 'hello' in 'hello world'" 10000000 loops, best of 3: 0.0513 usec per loop
注意：不要使用"in"。 @Gamrix使用"in"进行检查的问题很糟糕，因为它检查确切的字符而不是空格分隔的单词：例如：'hello world'中的'wo'将返回True，'hello world'中的'world'也将返回。最好使用正则表达式
@MANU，伙计真的吗？你有没看过他的正则表达式？ re.match('hello'，'hello world')这完全等同于"in"。这种行为并非完全不好。这对您的特定用例来说是不好的，这远远不是普遍的。
@arjunyg DUDE ...这个想法是让人们知道什么可能出错，以防他们只看一些简单的例子(使用"in"代表你好的世界)......带有已知概念的倾向(不完整的知识如同在这种情况下)对UNIVERSAL或非TRIVIAL用例是导致许多问题的根本原因!!
@MANU我发现不太可能不会认识到"hello world"中的"wo"返回true。在学习for循环和while循环之前，人们可能会理解这个概念，更不用说使用正则表达式等高级概念了。
@NicholasPipitone这个想法是......不是依靠'in'检查单词，而是使用正则表达式，正确的REGEX尽管并按照要求。如果您的用例在'in'中没问题，那么请继续，记下您正在做的事情。因为re.match('hello'，'hello world')和re.match('hello'，'helloworld')，两者都将返回相同的结果，但是用户想要的是模棱两可的。而一些不熟悉Python的人往往会犯这个错误。因此，为了清楚起见(主要针对新人)：re.match(r' bhello b'，'hello world')vs re.match(r' bhello b'，'helloworld')。

我有很多运行编译正则表达式1000次的经验，而不是即时编译，并没有注意到任何可察觉的差异。显然，这是轶事，当然不是反编译的好理由，但我发现差异可以忽略不计。

编辑：
在快速浏览一下实际的Python 2.5库代码之后，我看到Python无论如何都在内部编译AND CACHES正则表达式(包括对re.match()的调用)，所以你真的只是在正则表达式编译时才会改变，并且不应该不会节省很多时间 - 只需要检查缓存所需的时间(内部dict类型的键查找)。

来自模块re.py(评论是我的)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)

def _compile(*key):

# Does cache check at top of function
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None: return p

# ...
# Does actual compilation on cache miss
# ...

# Caches compiled regex
if len(_cache) >= _MAXCACHE:
_cache.clear()
_cache[cachekey] = p
return p

我仍然经常预编译正则表达式，但只是将它们绑定到一个漂亮的，可重用的名称，而不是任何预期的性能增益。

相关讨论

你的结论与你的答案不一致。如果自动编译和存储正则表达式，则在大多数情况下不需要手动执行。
J. F. Sebastian，它向程序员发出信号，表明有问题的正则表达式将被大量使用，而不是一次性的。
更重要的是，我要说如果你不想在你的应用程序的一些性能关键部分遭受编译和缓存命中，你最好在应用程序的非关键部分之前编译它们。。
我只能在2.5+和3.0中添加_MAXCACHE = 100。
如果您多次重复使用相同的正则表达式，我看到使用已编译的正则表达式的主要优点，从而减少了拼写错误的可能性。如果您只是调用它一次然后未编译就更具可读性。
所以，主要的区别在于你使用了很多不同的正则表达式(超过_MAXCACHE)，其中一些只有一次而其他的很多次......那么保持编译后的表达式对于那些使用更多的正则表达式很重要当它满了时，它不会从缓存中刷新。
@ J.F。 - 此外，如果您依赖于编译和缓存，谁知道什么时候可以清除缓存，然后您的正则表达式将不得不重新编译。
如果您使用的是python <2.7或3.1，则"re.sub"缺少'flags'参数。因此，如果你想做不区分大小写的re.sub，你就会陷入re.compile("...", re.I).sub(...)。
我认为即使你排除错字和GC的未知时间，每个人都错过了整体观点，事实是如果你需要连续运行相同的正则表达100,000次而不必进行缓存查找100,000次更快，让我们考虑一下在解析带有正则表达式的大型日志文件方面，无论如何，lanauage都必须采取更好的措施。
避免编译步骤的另一个原因是使正则表达式更接近它们的使用点。我有一个有几十个替换的循环。为编译的正则表达式创建名称，并且必须在循环开始时查找RE括号的数量，这使得程序的可读性降低。
如果它被饱和，整个缓存就被清除了？！？！？我要去的是LFU或LRU缓存。甚至更多的理由来编译我打算不止一次使用的模式。你永远不知道是否还有一些其他模块也会填充并清除缓存。
@WojonsTech我也想到了这一点，但它有什么重大影响吗？
"我有很多运行编译正则表达式的经验，而不是在运行中进行编译，并且没有注意到任何可察觉的差异......"*这太模糊和误导了。使用预编译的正则表达式比第二次使用快3倍，甚至第一次使用也快2倍??。点。问题是，如果正则表达式速度对某项任务至关重要。否则，直接使用模式ad hoc在大多数情况下更容易编写，读取和调试。

对我来说，re.compile的最大好处是能够将正则表达式的定义与其使用区分开来。

即使是一个简单的表达式，例如0|[1-9][0-9]*(基数为10但没有前导零的整数)也可能足够复杂，您不必重新键入它，检查是否有任何拼写错误，以后必须重新检查是否存在拼写错误你开始调试了。另外，使用变量名称(例如num或num_b10而不是0|[1-9][0-9]*)会更好。

当然可以存储字符串并将它们传递给re.match;但是，那可读性较差：

1
2
3

num ="..."
# then, much later:
m = re.match(num, input)

与编译：

1
2
3

num = re.compile("...")
# then, much later:
m = num.match(input)

虽然它非常接近，但是当反复使用时，第二行的最后一行感觉更自然，更简单。

相关讨论

FWIW：

1
2
3
4
5

$ python -m timeit -s"import re""re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s"import re; h=re.compile('hello')""h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

所以，如果你将要使用相同的正则表达式，那么做re.compile(特别是对于更复杂的正则表达式)可能是值得的。

反对过早优化的标准论据适用，但如果您怀疑您的正则表达式可能成为性能瓶颈，我认为您不会因使用re.compile而失去太多的清晰度/直截了当。

更新：

在Python 3.6(我怀疑上面的时间是使用Python 2.x)和2018硬件(MacBook Pro)完成的，我现在得到以下时间：

1
2
3
4
5
6
7
8
9
10
11

% python -m timeit -s"import re""re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s"import re; h=re.compile('hello')""h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s"import re""h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

我还添加了一个案例(注意最后两次运行之间的引号差异)，表明re.match(x, ...)字面上[大致]等同于re.compile(x).match(...)，即编译表示的幕后缓存似乎没有发生。

相关讨论

这里你的方法存在的主要问题，因为setup参数不包括在时间中。因此，您已从第二个示例中删除了编译时间，并在第一个示例中将其平均化。这并不意味着第一个示例每次都会编译。
是的，我同意这不是两种情况的公平比较。
我明白你的意思了，但是在正确使用regexp多次的实际应用程序中究竟会发生什么？
@dF：你是对的，如果你只关心代码的一个特定部分的性能，并且你能够在另一部分预编译正则表达式。否则，您需要计算re.compile调用并将其包含在第二个数字中，以便进行公平比较。
@Triptych，@ Kiv：编译正则表达式与使用分离的重点是尽量减少编译;将它从时间中移除正是dF应该做的，因为它最准确地代表了现实世界的使用。编译时间与timeit.py在此处的计时方式特别无关;它执行多次运行并且仅报告最短的运行，此时编译的正则表达式被缓存。您在这里看到的额外成本不是编译正则表达式的成本，而是在编译的正则表达式缓存(字典)中查找它的成本。
这项测试具有误导性。两个测试的总执行时间在实际代码中是等效的。首先编译让你决定何时吃掉那些CPU周期，而不是。
@Triptych import re是否应该移出设置？这都是关于你想要衡量的地方。如果我多次运行python脚本，它会有import re时间命中。比较两者时，将两条线分开以进行计时非常重要。是的，就像你说的那样，你将有时间命中。比较显示，您要么花费一次时间命中并重复较少的时间来进行编译，要么每当假设缓存在两次调用之间被清除时就会获得命中，这可能会发生。添加h=re.compile('hello')的时间将有助于澄清。
如果您使用的是类似Linux的操作系统，只需执行time python -m...而不是python -m...，结果就像我的一样，您会看到预编译确实是一个重要的性能获胜(对我而言，首先CPU总量为3.87秒，秒为1.64秒。
正则表达式是一个简单的字符串。时间结果是第一次正确的正则表达式搜索时间+缓存搜索时间。抛弃初始编译时间和此方法的其他缺陷，如果我们要大幅增加正则表达式时间，两个时间之间的差异是否需要预编译？
运行python -m timeit -s"import re; n=1000""h=re.compile('hello'); [ h.match('hello world') for i in range(n) ]" vs python -m timeit -s"import re; n=1000""[ re.match('hello', 'hello world') for i in range(n) ]"仍然可以为预编译的正则表达式提供x2更快的运行时间。这反对以前的评论说这是不公平的比较。我建议在答案中加入这些基准。

这是一个简单的测试用例：

1
2
3
4
5
6
7
8

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}","123-123-1234")'; done
1 loops, best of 3: 3.1 usec per loop
10 loops, best of 3: 2.41 usec per loop
100 loops, best of 3: 2.24 usec per loop
1000 loops, best of 3: 2.21 usec per loop
10000 loops, best of 3: 2.23 usec per loop
100000 loops, best of 3: 2.24 usec per loop
1000000 loops, best of 3: 2.31 usec per loop

用re.compile：

1
2
3
4
5
6
7
8

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loops, best of 3: 1.91 usec per loop
10 loops, best of 3: 0.691 usec per loop
100 loops, best of 3: 0.701 usec per loop
1000 loops, best of 3: 0.684 usec per loop
10000 loops, best of 3: 0.682 usec per loop
100000 loops, best of 3: 0.694 usec per loop
1000000 loops, best of 3: 0.702 usec per loop

因此，即使您只匹配一次，在这个简单的情况下，编译看起来似乎更快。

相关讨论

我自己试过这个。对于从字符串中解析数字并对其求和的简单情况，使用编译的正则表达式对象的速度大约是使用re方法的两倍。

正如其他人指出的那样，re方法(包括re.compile)在先前编译的表达式的缓存中查找正则表达式字符串。因此，在正常情况下，使用re方法的额外成本仅仅是高速缓存查找的成本。

但是，检查代码，显示缓存限制为100个表达式。这引出了一个问题，溢出缓存有多痛苦？该代码包含正则表达式编译器的内部接口re.sre_compile.compile。如果我们调用它，我们绕过缓存。对于基本正则表达式，例如r'\w+\s+([0-9_]+)\s+\w*'，结果大约慢了两个数量级。

这是我的测试：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

#!/usr/bin/env python
import re
import time

def timed(func):
def wrapper(*args):
t = time.time()
result = func(*args)
t = time.time() - t
print '%s took %.3f seconds.' % (func.func_name, t)
return result
return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString ="average 2 never"

@timed
def noncompiled():
a = 0
for x in xrange(1000000):
m = re.match(regularExpression, testString)
a += int(m.group(1))
return a

@timed
def compiled():
a = 0
rgx = re.compile(regularExpression)
for x in xrange(1000000):
m = rgx.match(testString)
a += int(m.group(1))
return a

@timed
def reallyCompiled():
a = 0
rgx = re.sre_compile.compile(regularExpression)
for x in xrange(1000000):
m = rgx.match(testString)
a += int(m.group(1))
return a

@timed
def compiledInLoop():
a = 0
for x in xrange(1000000):
rgx = re.compile(regularExpression)
m = rgx.match(testString)
a += int(m.group(1))
return a

@timed
def reallyCompiledInLoop():
a = 0
for x in xrange(10000):
rgx = re.sre_compile.compile(regularExpression)
m = rgx.match(testString)
a += int(m.group(1))
return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print"r1 =", r1
print"r2 =", r2
print"r3 =", r3
print"r4 =", r4
print"r5 =", r5

这是我的机器上的输出：

1
2
3
4
5
6
7
8
9
10
11

$ regexTest.py
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 = 2000000
r2 = 2000000
r3 = 2000000
r4 = 2000000
r5 = 20000

'reallyCompiled'方法使用内部接口，绕过缓存。注意，在每个循环迭代中编译的那个迭代只迭代10,000次，而不是一百万次。

我同意Honest Abe的说法，给出的例子中的match(...)是不同的。它们不是一对一的比较，因此结果各不相同。为了简化我的回复，我使用A，B，C，D来处理这些函数。哦，是的，我们正在处理re.py中的4个函数而不是3个。

运行这段代码：

1 2	h = re.compile('hello') # (A) h.match('hello world') # (B)

与运行此代码相同：

1	re.match('hello', 'hello world') # (C)

因为，当查看源re.py时，(A + B)表示：

1 2	h = re._compile('hello') # (D) h.match('hello world')

(C)实际上是：

1	re._compile('hello').match('hello world')

因此，(C)与(B)不同。实际上，(C)在调用(D)之后调用(B)，其也被(A)调用。换句话说，(C) = (A) + (B)。因此，在循环内比较(A + B)与循环内的(C)具有相同的结果。

乔治的regexTest.py为我们证明了这一点。

1
2
3

noncompiled took 4.555 seconds. # (C) in a loop
compiledInLoop took 4.620 seconds. # (A + B) in a loop
compiled took 2.323 seconds. # (A) once + (B) in a loop

每个人的兴趣是，如何获得2.323秒的结果。为了确保compile(...)只被调用一次，我们需要将已编译的正则表达式对象存储在内存中。如果我们使用类，我们可以存储对象并在每次调用函数时重用。

1
2
3
4

class Foo:
regex = re.compile('hello')
def my_function(text)
return regex.match(text)

如果我们没有使用课程(这是我今天的要求)，那么我没有评论。我还在学习在Python中使用全局变量，我知道全局变量是一件坏事。

还有一点，我相信使用(A) + (B)方法有优势。以下是我观察到的一些事实(如果我错了请纠正我)：

调用一次，它将在_cache中执行一次搜索，然后执行一次sre_compile.compile()以创建正则表达式对象。调用A两次，它将执行两次搜索和一次编译(因为正则表达式对象被缓存)。

如果_cache在中间刷新，则regex对象从内存中释放，Python需要再次编译。 (有人建议Python不会重新编译。)

如果我们使用(A)保留正则表达式对象，则正则表达式对象仍将进入_cache并以某种方式刷新。但是我们的代码会对它进行引用，并且regex对象不会从内存中释放出来。那些，Python不需要再次编译。

George的testInLoop vs编译的2秒差异主要是构建密钥和搜索_cache所需的时间。它并不意味着正则表达式的编译时间。

George的真正编译测试显示了每次真正重新编译时会发生什么：它会慢100倍(他将循环从1,000,000减少到10,000)。

以下是(A + B)优于(C)的唯一情况：

如果我们可以在类中缓存正则表达式对象的引用。

如果我们需要重复调??用(B)(在循环内或多次)，我们必须在循环外缓存对regex对象的引用。

(C)足够好的情况：

我们无法缓存参考。

我们偶尔使用它一次。

总的来说，我们没有太多的正则表达式(假设编译的一个永远不会被刷新)

回顾一下，这是A B C：

1
2
3

h = re.compile('hello') # (A)
h.match('hello world') # (B)
re.match('hello', 'hello world') # (C)

谢谢阅读。

大多数情况下，无论是否使用re.compile，都没什么区别。在内部，所有函数都是在编译步骤中实现的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
return _compile(pattern, flags).finditer(string)

另外，re.compile()会绕过额外的间接和缓存逻辑：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 512
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p

除了使用re.compile带来的小速度优势之外，人们还喜欢通过命名可能复杂的模式规范并将它们与应用的业务逻辑分离而来的可读性：

1
2
3
4
5
6
7
8
9
10

#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?') # Integer or decimal number
assign_pattern = re.compile(r':=') # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+') # Identifiers
whitespace_pattern = re.compile(r'[\t ]+') # Spaces and tabs

#### Applications ########################################################

if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()

注意，另一位受访者错误地认为pyc文件直接存储了编译模式;但是，实际上每次加载PYC时都会重建它们：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
f.read(8)
dis(marshal.load(f))

1 0 LOAD_CONST 0 (-1)
3 LOAD_CONST 1 (None)
6 IMPORT_NAME 0 (re)
9 STORE_NAME 0 (re)

3 12 LOAD_NAME 0 (re)
15 LOAD_ATTR 1 (compile)
18 LOAD_CONST 2 ('[aeiou]{2,5}')
21 CALL_FUNCTION 1
24 STORE_NAME 2 (lc_vowels)
27 LOAD_CONST 1 (None)
30 RETURN_VALUE

上面的反汇编来自包含以下内容的tmp.py的PYC文件：

1 2	import re lc_vowels = re.compile(r'[aeiou]{2,5}')

相关讨论

一般来说，我发现使用标志更容易(至少更容易记住)，比如编译模式时的re.I比使用内联标志更容易。

1
2
3

>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']

1 2	>>> re.findall('(?i)foo','some string FoO bar') ['FoO']

使用re.compile()有一个额外的好处，就是使用re.VERBOSE在我的正则表达式模式中添加注释的形式

1
2
3
4
5

pattern = '''
hello[ ]world # Some info on my pattern logic. [ ] to recognize space
'''

re.search(pattern, 'hello world', re.VERBOSE)

虽然这不会影响代码的运行速度，但我喜欢这样做，因为它是我评论习惯的一部分。我完全不喜欢花时间试图记住我想要进行修改的2个月后我的代码背后的逻辑。

相关讨论

使用给定的示例：

1 2	h = re.compile('hello') h.match('hello world')

上例中的匹配方法与下面使用的匹配方法不同：

1	re.match('hello', 'hello world')

re.compile()返回正则表达式对象，这意味着h是一个正则表达式对象。

正则表达式对象有自己的匹配方法，可选的pos和endpos参数：

regex.match(string[, pos[, endpos]])

POS

The optional second parameter pos gives an index in the string where
the search is to start; it defaults to 0. This is not completely
equivalent to slicing the string; the '^' pattern character matches at
the real beginning of the string and at positions just after a
newline, but not necessarily at the index where the search is to
start.

endpos

The optional parameter endpos limits how far the string will be
searched; it will be as if the string is endpos characters long, so
only the characters from pos to endpos - 1 will be searched for a
match. If endpos is less than pos, no match will be found; otherwise,
if rx is a compiled regular expression object, rx.search(string, 0,
50) is equivalent to rx.search(string[:50], 0).

正则表达式对象的search，findall和finditer方法也支持这些参数。

正如您所见，re.match(pattern, string, flags=0)不支持它们，
它的搜索，findall和finditer也没有。

匹配对象具有补充这些参数的属性：

match.pos

The value of pos which was passed to the search() or match() method of
a regex object. This is the index into the string at which the RE
engine started looking for a match.

match.endpos

The value of endpos which was passed to the search() or match() method
of a regex object. This is the index into the string beyond which the
RE engine will not go.

正则表达式对象有两个唯一的，可能有用的属性：

regex.groups

The number of capturing groups in the pattern.

regex.groupindex

A dictionary mapping any symbolic group names defined by (?P) to
group numbers. The dictionary is empty if no symbolic groups were used
in the pattern.

最后，匹配对象具有以下属性：

match.re

The regular expression object whose match() or search() method
produced this match instance.

我在这里讨论之前遇到了这个测试。但是，运行它我认为我至少会发布我的结果。

我在Jeff Friedl的"掌握正则表达式"中窃取并混淆了这个例子。这是在运行OSX 10.6(2Ghz intel core 2 duo，4GB ram)的macbook上。 Python版本是2.6.1。

运行1 - 使用re.compile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import re
import time
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$')
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString =""
for i in range(1000):
TestString +="abababdedfg"
StartTime = time.time()
for i in range(TimesToDo):
Regex1.search(TestString)
Seconds = time.time() - StartTime
print"Alternation takes" + fpformat.fix(Seconds,3) +" seconds"

StartTime = time.time()
for i in range(TimesToDo):
Regex2.search(TestString)
Seconds = time.time() - StartTime
print"Character Class takes" + fpformat.fix(Seconds,3) +" seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

运行2 - 不使用re.compile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

import re
import time
import fpformat

TimesToDo = 1000
TestString =""
for i in range(1000):
TestString +="abababdedfg"
StartTime = time.time()
for i in range(TimesToDo):
re.search('^(a|b|c|d|e|f|g)+$',TestString)
Seconds = time.time() - StartTime
print"Alternation takes" + fpformat.fix(Seconds,3) +" seconds"

StartTime = time.time()
for i in range(TimesToDo):
re.search('^[a-g]+$',TestString)
Seconds = time.time() - StartTime
print"Character Class takes" + fpformat.fix(Seconds,3) +" seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds

这个答案可能会迟到，但这是一个有趣的发现。如果您计划多次使用正则表达式，使用编译可以真正节省您的时间(这也在文档中提到)。下面你可以看到，当直接调用match方法时，使用编译的正则表达式是最快的。将已编译的正则表达式传递给re.match会使速度更慢，并且将带有模式字符串的re.match传递到中间位置。

1
2
3
4
5
6
7
8

>>> ipr = r'\D+((([0-2][0-5]?[0-5]?)\.){3}([0-2][0-5]?[0-5]?))\D+'
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.5077415757028423
>>> ipr = re.compile(ipr)
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.8324008992184038
>>> average(*timeit.repeat("ipr.match('abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
0.9187896518778871

除了性能差异之外，使用re.compile并使用编译的正则表达式对象进行匹配(无论正则表达式相关的操作)使得语义更加清晰，以便Python运行时。

我有一些调试一些简单代码的痛苦经历：

1	compare = lambda s, p: re.match(p, s)

后来我用比较

1	[x for x in data if compare(patternPhrases, x[columnIndex])]

其中patternPhrases应该是包含正则表达式字符串的变量，x[columnIndex]是包含字符串的变量。

我有麻烦，patternPhrases与一些预期的字符串不匹配！

但是，如果我使用re.compile形式：

1	compare = lambda s, p: p.match(s)

然后在

1	[x for x in data if compare(patternPhrases, x[columnIndex])]

Python会抱怨"字符串没有匹配属性"，因为通过compare中的位置参数映射，x[columnIndex]被用作正则表达式！，当我实际意味着

1	compare = lambda p, s: p.match(s)

在我的例子中，使用re.compile更明确的是正则表达式的目的，当它的值被肉眼隐藏时，因此我可以从Python运行时检查获得更多帮助。

因此，我的教训是，当正则表达式不仅仅是文字字符串时，我应该使用re.compile让Python帮我断言我的假设。

根据Python文档：

序列

1 2	prog = re.compile(pattern) result = prog.match(string)

相当于

1	result = re.match(pattern, string)

但是当在单个程序中多次使用表达式时，使用re.compile()并保存生成的正则表达式对象以便重用更有效。

所以我的结论是，如果你要为许多不同的文本匹配相同的模式，你最好预先编译它。

除了表现。

使用compile帮助我区分概念
1.模块(重新)，
2.正则表达式对象
3.匹配对象
当我开始学习正则表达式时

1
2
3
4
5
6
7
8
9
10
11

#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'

作为补充，我制作了模块re的详尽备忘单供您参考。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()','(?:)', '(?!)' '|', '\', 'backreferences and named group'],
'repetition' : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead' : ['(?=...)', '(?!...)'],
'lookbehind' : ['(?<=...)','(?<!...)'],
'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor' : ['^', '\b', '$'],
'non_printable' : ['
', '\t', '
', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}

有趣的是，编译确实对我来说更有效(Win XP上的Python 2.5.2)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str ="average 2 never"
a = 0

t = time.time()

for i in xrange(1000000):
if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
#~ if rgx.match(str):
a += 1

print time.time() - t

按原样运行上面的代码，然后用两条if行注释一次，编译后的正则表达式快两倍

相关讨论

这是一个很好的问题。你经常看到人们毫无理由地使用re.compile。它降低了可读性。但是肯定有很多次需要预编译表达式。就像你在循环中重复使用它一样或者某些东西。

这就像编程的一切(实际上是生活中的一切)。运用常识。

相关讨论

我真的很尊重上述所有答案。从我的意见
是!肯定值得使用re.compile而不是每次都一次又一次地编译正则表达式。

Using re.compile makes your code more dynamic, as you can call the already compiled regex, instead of compiling again and aagain. This thing benefits you in cases:

处理器努力

时间复杂性。

使regex Universal。(可用于findall，搜索，匹配)

并使您的程序看起来很酷。

示例：

1 2	example_string ="The room number of her room is 26A7B." find_alpha_numeric_string = re.compile(r"\b\w+\b")

在Findall中使用

1	find_alpha_numeric_string.findall(example_string)

在搜索中使用

1	find_alpha_numeric_string.search(example_string)

Similarly you can use it for: Match and Substitute

(几个月后)很容易在re.match周围添加你自己的缓存，
或其他任何事情 -

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
""" Re.py: Re.match = re.match + cache
efficiency: re.py does this already (but what's _MAXCACHE ?)
readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile("" ))

def match( pattern, str, *opt ):
""" Re.match = re.match + cache re.compile( pattern )
"""
if type(pattern) == _re_type:
cpat = pattern
elif pattern in cache:
cpat = cache[pattern]
else:
cpat = cache[pattern] = re.compile( pattern, *opt )
return cpat.match( str )

# def search ...

一个wibni，如果：cachehint(size =)，cacheinfo() - > size，hits，nclear，那不是很好吗...

I've had a lot of experience running a compiled regex 1000s
of times versus compiling on-the-fly, and have not noticed
any perceivable difference

对已接受答案的投票导致假设@Triptych所说的对所有情况都是正确的。这不一定是真的。一个很大的区别是当你必须决定是接受正则表达式字符串还是编译的正则表达式对象作为函数的参数时：

1
2
3
4
5
6
7
8
9
10
11
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: x.match(y) # accepts compiled regex as parameter
... h=re.compile('hello')
...""", stmt="f(h, 'hello world')")
0.32881879806518555
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: re.compile(x).match(y) # compiles when called
...""", stmt="f('hello', 'hello world')")
0.809190034866333

编译正则表达式总是更好，以防您需要重用它们。

请注意，上面的timeit中的示例模拟在导入时创建一个已编译的正则表达式对象，而在匹配时需要"在运行中"。

作为替代答案，正如我之前未提及的那样，我将继续引用Python 3文档：

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

正则表达式在使用第二个版本之前编译。如果你要多次执行它，最好先编译它。如果没有编译每次匹配一个关闭是没关系的。

我想激励预编译在概念上和"文明"(如"文学编程")中都是有利的。看看这段代码：

1
2
3
4
5
6
7
8
9
from re import compile as _Re

class TYPO:

def text_has_foobar( self, text ):
return self._text_has_foobar_re_search( text ) is not None
_text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()

在你的申请中，你写道：

1
2
from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )

这在功能方面就像它可以获得的那样简单。因为这是一个很短的例子，我把在一行中得到_text_has_foobar_re_search的方式混为一谈。这段代码的缺点是，无论TYPO库对象的生命周期是什么，它都会占用一点内存;优点是，当进行foobar搜索时，您将获得两个函数调用和两个类字典查找。 re缓存了多少个正则表达式，并且缓存的开销与此无关。

将其与更常见的风格进行比较，如下：

1
2
3
4
5
6
import re

class Typo:

def text_has_foobar( self, text ):
return re.compile( r"""(?i)foobar""" ).search( text ) is not None

在申请中：

1
2
typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )

我欣然承认我的风格对于python非常不寻常，甚至可能引起争议。但是，在更接近匹配python如何使用的示例中，为了进行单个匹配，我们必须实例化一个对象，执行三个实例字典查找，并执行三个函数调用;另外，当使用超过100个正则表达式时，我们可能会遇到re缓存问题。另外，正则表达式隐藏在方法体内，大部分时间都不是一个好主意。

可以说，每个措施的子集---有针对性，别名的进口报表;适用的别名方法;减少函数调用和对象字典查找---可以帮助减少计算和概念的复杂性。

相关讨论

WTF。你不仅要挖出一个陈旧的问题。你的代码在很多层面都是非惯用的 - (ab)使用类作为命名空间，模块就足够了，大写类名等等......请参阅pastebin.com/iTAXAWen以获得更好的实现。更不用说你使用的正则表达式也被破坏了。总的来说，-1

有罪。这是一个老问题，但我不介意在缓慢的谈话中成为＃100。问题尚未结束。我确实警告过我的代码可能会对某些人有所反对。我想如果你能把它看作仅仅展示python中可行的东西，比如：如果我们把所有东西，我们相信的一切，作为可选，然后以任何方式修补，我们可以做些什么样的事情得到？我相信你可以看出这个解决方案的优点和缺点，并且可以更明确地抱怨。否则我必须得出结论，你的错误主张仅仅依赖于PEP008

不，这不是关于PEP8。这只是命名惯例，我永远不会因为不遵循这些惯例而投票??。我低估了你，因为你展示的代码写得很差。它无缘无故地违反惯例和惯用语，并且是不成熟优化的化身：你必须优化所有其他代码的生活日光，以便成为瓶颈，即便如此，我提供的第三次重写更短，更多惯用，同样快速的推理(相同数量的属性访问)。

"写得不好" - 就像为什么一样？"违反惯例和成语" - 我警告过你。"无缘无故" - 是的，我确实有理由：简化复杂性无用的地方;"过早优化的化身" - 我非常喜欢选择可读性和效率平衡的编程风格; OP要求引出"使用re.compile的好处"，我理解这是一个关于效率的问题。"(ab)使用类作为名称空间" - 这是你的话语是滥用的。因为你有一个"自我"的参考点。我尝试使用模块用于此目的，类更好地工作。

"大写班级名称"，"不，这不是关于PEP8" - 你显然是如此愤怒，你甚至无法说出先发生什么事。"WTF"，"错误"---看你有多情绪化？请更客观，更少泡沫。

我的理解是这两个例子实际上是等价的。唯一的区别是，在第一个中，您可以在其他地方重用已编译的正则表达式，而不会导致它再次编译。

以下是您的参考：http：//diveintopython3.ep.io/refactoring.html

Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)

相关讨论

我没有投票给你，但从技术上讲这是错误的：无论如何Python都不会重新编译