关于bash：检查文件中是否存在多个字符串或正则表达式

Check if all of multiple strings or regexes exist in a file

我想检查我的所有字符串是否都存在于一个文本文件中。它们可以存在于同一行或不同的行上。部分匹配应该可以。这样地：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

在上面的例子中，我们可以用正则表达式代替字符串。

例如，以下代码检查文件中是否存在我的任何字符串：

1
2
3

if grep -EFq"string1|string2|string3" file; then
# there is at least one match
fi

如何检查它们是否都存在？因为我们只是对所有匹配的存在感兴趣，所以应该在所有字符串匹配后立即停止读取文件。

是否可以在不需要多次调用grep(当输入文件较大或有大量字符串要匹配时，它不会缩放)或使用awk或python之类的工具的情况下进行此操作？

另外，是否有一个字符串的解决方案可以很容易地扩展到regex？

相关讨论

awk是那些发明grep、shell等的人发明的用来做一般文本操作的工具，所以你不知道为什么要尝试避免它。

如果你想要的是简洁，下面是GNUawk一行程序，按照你的要求执行：

1	awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

以下是一系列其他信息和选项：

假设你真的在寻找字符串，它应该是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

awk -v strings='string1 string2 string3' '
BEGIN {
numStrings = split(strings,tmp)
for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
for (str in strs) {
if ( index($0,str) ) {
delete strs[str]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file

一旦所有字符串匹配，上述操作将停止读取文件。

如果您要查找regexps而不是字符串，那么使用gnu awk来查找多字符rs，并在结尾部分保留$0，您可以执行以下操作：

1	awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

实际上，即使是字符串，也可以做到：

1	awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

上述2个GNUawk解决方案的主要问题是，像@anubhava的GNUgrep-p解决方案一样，整个文件必须一次读取到内存中，而对于上面的第一个awk脚本，它将在任何unix机器上任何shell中的任何awk中工作，并且一次只存储一行输入。

我看到你在问题下面添加了一条评论，说你可以有几千个"模式"。假设您的意思是"字符串"，那么您可以从一个文件中读取它们，而不是将它们作为参数传递给脚本，例如，对于多字符RS，使用gnu awk，对于每行使用一个搜索字符串的文件：

1
2
3
4
5
6
7
8

awk '
NR==FNR { strings[$0]; next }
{
for (string in strings)
if ( !index($0,string) )
exit 1
}
' file_of_strings RS='^$' file_to_be_searched

对于regexps，应该是：

1
2
3
4
5
6
7
8

awk '
NR==FNR { regexps[$0]; next }
{
for (regexp in regexps)
if ( $0 !~ regexp )
exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

如果您没有GNUawk，并且您的输入文件不包含nul字符，那么您可以使用RS='\0'而不是RS='^$'，或者在读取变量时一次附加到变量一行，然后在结束部分处理该变量，从而获得与上面相同的效果。

如果要搜索的文件太大，无法放入内存，那么字符串应该是这样的：

1
2
3
4
5
6
7
8
9
10
11
12
13

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
for (string in strings) {
if ( index($0,string) ) {
delete strings[string]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

以及regexps的等效值：

1
2
3
4
5
6
7
8
9
10
11
12
13

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
for (regexp in regexps) {
if ( $0 ~ regexp ) {
delete regexps[regexp]
numRegexps--
}
}
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched

相关讨论

git grep

下面是使用具有多个模式的git grep的语法：

1	git grep --all-match --no-index -l -e string1 -e string2 -e string3 file

您还可以将模式与诸如--and、--or和--not等布尔表达式结合起来。

查看man git-grep获取帮助。

--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.

--no-index Search files in the current directory that is not managed by Git.

-l/--files-with-matches/--name-only Show only the names of files.

-e The next parameter is the pattern. Default is to use basic regexp.

其他需要考虑的参数：

--threads Number of grep worker threads to use.

-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.

要更改模式类型，还可以使用-G/--basic-regexp(默认)、-F/--fixed-strings、-E/--extended-regexp、-P/--perl-regexp、-f file等。

相关讨论

首先，您可能想使用awk。由于您在问题陈述中消除了该选项，所以可以这样做，这提供了一种方法。它可能比使用awk慢得多，但如果您无论如何都想这样做…

这是基于以下假设：g

调用awk是不可接受的
多次调用grep是不可接受的。
任何其他外部工具的使用都是不可接受的。
调用grep少于一次是可以接受的
如果一切都找到了，它必须返回成功；如果没有，它必须返回失败。
使用bash代替外部工具是可以接受的。
对于正则表达式版本，bash版本大于等于3。

这可能满足您的所有要求：(regex版本缺少一些注释，请改为查看字符串版本)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97

#!/bin/bash

multimatch() {
filename="$1" # Filename is first parameter
shift # move it out of the way that"$@" is useful
strings=("$@" ) # search strings into an array

declare -a matches # Array to keep track which strings already match

# Initiate array tracking what we have matches for
for ((i=0;i<${#strings[@]};i++)); do
matches[$i]=0
done

while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
if ["${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
string="${strings[$i]}" # fetch the string
if [[ $line = *$string* ]]; then # check if it matches
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if ["$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[@]};i++)); do
if ["${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if ["$somethingleft" -eq 0 ]; then return 0; fi
fi
done <"$filename"

# If we get here, we didn't have everything in the file
return 1
}

multimatch_regex() {
filename="$1" # Filename is first parameter
shift # move it out of the way that"$@" is useful
regexes=("$@" ) # Regexes into an array

declare -a matches # Array to keep track which regexes already match

# Initiate array tracking what we have matches for
for ((i=0;i<${#regexes[@]};i++)); do
matches[$i]=0
done

while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
if ["${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
regex="${regexes[$i]}" # Get regex from array
if [[ $line =~ $regex ]]; then # We use the bash regex operator here
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if ["$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[@]};i++)); do
if ["${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if ["$somethingleft" -eq 0 ]; then return 0; fi
fi
done <"$filename"

# If we get here, we didn't have everything in the file
return 1
}

if multimatch"filename" string1 string2 string3; then
echo"file has all strings"
else
echo"file miss one or more strings"
fi

if multimatch_regex"filename""regex1""regex2""regex3"; then
echo"file match all regular expressions"
else
echo"file does not match all regular expressions"
fi

基准点

我在Linux 4.16.2的arch/arm/中搜索了.c、.h和.sh，以查找字符串"void"、"function"和"define"。(添加了shell包装器/调优了代码，所有代码都可以称为testname [...]，并且可以使用if检查结果)

结果：(用time测量，real时间四舍五入到最接近的半秒)

江东十一〔13〕49号
埃多克斯1〔14〕55秒
MatcALL：105s
文件匹配所有名称：4s
awk(第一版)：4s
AGRP:4.5S
Perl Re(-r)：10.5秒
Perl非RE:9.5秒
Perl非重新优化：5s(删除了getopt:：std和regex支持以加快启动速度)
Perl重新优化：7s(删除了getopt:：std和non-regex对更快启动的支持)
Git GRIP: 3.5S
C版(无regex)：1.5s

(多次调用grep，特别是使用递归方法，效果比我预期的要好)

相关讨论

一些基准测试(比如scala文件示例)会很有趣…它可能比awk慢得多，这主要是一个练习，表明所述要求可以满足…(速度并没有作为一项要求被提到)(这似乎也没有使用外部进程-这对速度有好处，但是bash文本处理，这对速度有坏处，至少与grep中的c代码相比…)
是的，你是对的，它会比一个awk脚本慢几个数量级。请参阅why-is-using-a-shell-loop-to-process-text-considered-bad-pra&zwnj；&8203；ctice。你在剧本中的几个地方使用了pattern--这是一个非常模糊的词，通常应该避免使用，因此，如果你在谈论string、regexp、globbing pattern或其他东西，请在每次使用时都加以澄清。
@Edmorton：这似乎是避免使用标准工具(如awk)或非标准工具(如perl)的唯一方法(我假设唯一可接受的非shell工具是grep，只调用一次…)(问题是它是否"可能"，而不是它是否是一个好主意；-))搜索字符串使用了"模式"(它是可以是regex，也可以是与globbing匹配的普通字符串，这取决于变量(regex版本是在我意识到bash具有内置regex之后完成的，并且是对第一个版本的简单修改)(在fork之后添加了详细的注释)
@Edmorton:很多(但不是全部)对read方法的批评都是多次调用外部工具，这是可以避免的。(它确实需要一个空白的ifs来防止去掉前导空格(我知道，对于单个变量，术语之间的空白不受$IFS的影响)
感谢您进行基准测试。由于OP正在搜索数千个字符串，您是否可以使用大量字符串(例如，至少1000个，所有这些字符串都出现在目标文件中，其中一些是彼此的子集，其中一些包含regexp元字符)重试？当要搜索的字符串数量变大(和匹配)时，各种解决方案的执行方式存在巨大差异，另外，如果给定的字符串是其他字符串或包含re-char的字符串的子字符串，则某些解决方案将失败，并且这些差异不会仅针对这3个字符串出现。
@埃德莫顿：这变得很棘手-许多解决方案都有不同的接口，很难(而且可能很慢)进行映射。cli的方法可能需要一个非shell方法来调用它们来构建一个靠近ARG_MAX的argv(尽管如果它们有适当的出口代码和&&操作符，它们可以任意组合，但是如果第一部分匹配，则文件的多个扫描是不利的)
(输出也需要检查)(测试集上的那些字符串(和扩展名)主要是选择一些文件相对较快地匹配所有字符串(不利的是，不管匹配与否，都会扫描整个文件多次)，并且有些文件不太可能匹配所有字符串(.sh个)，以便世卫组织需要扫描le文件(我从整个源代码开始，但在等待shell文件时没有耐心)

此gnu-awk脚本可以工作：

1
2
3
4
5
6
7
8
9
10
11
12

cat fileSearch.awk
re =="" {
exit
}
{
split($0, null,"\\<(" re"\\>)", b)
for (i=1; i<=length(b); i++)
gsub("\\<" b[i]"([|]|$)","", re)
}
END {
exit (re !="")
}

然后使用它作为：

1
2
3
4
5

if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
echo"all strings were found"
else
echo"all strings were not found"
fi

或者，您可以将此gnu grep解决方案与PCRE选项一起使用：

1	grep -qzP '(?s)(?=.\bstring1\b)(?=.\bstring2\b)(?=.*\bstring3\b)' file

使用-z我们使grep把完整的文件读入一个字符串。
我们使用多个lookahead断言断言文件中存在所有字符串。
regex必须使用(?s)或DOTALLmod使.*在线路上匹配。

根据man grep号文件：

1
2
3

-z, --null-data
Treat input and output data as sequences of lines, each terminated by a
zero byte (the ASCII NUL character) instead of a newline.

相关讨论

递归解决方案。逐个迭代文件。对于每个文件，检查它是否匹配第一个模式并提前中断(第一个匹配时为-m1)，仅当它匹配第一个模式时，搜索第二个模式，依此类推：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

#!/bin/bash

patterns="$@"

fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo"$file"
else
shift
pattern=$1
shift
grep -m1 -q"$pattern""$file" && fileMatchesAllNames"$file" $@
fi
}

for file in *
do
test -f"$file" && fileMatchesAllNames"$file" $patterns
done

用途：

1 2	./allfilter.sh cat filter java test.sh

在当前DIR中搜索令牌"CAT"、"过滤器"和"Java"。只在"test.sh"中找到它们。

因此，grep通常在最坏的情况下被调用(在每个文件的最后一行中查找第一个n-1模式，n-th模式除外)。

但是，如果可能的话，有了一个通知性的排序(很少先匹配，早期先匹配)，解决方案应该是合理的快速，因为许多文件因为不匹配第一个关键字而被提前放弃，或者因为它们匹配了接近顶部的关键字而被提前接受。

示例：搜索一个scala源文件，该文件包含tailrec(很少使用)、mutable(很少使用，但如果使用，则在import语句中接近top)、main(很少使用，通常不接近top)和println(经常使用，位置不可预测)，您将对其进行排序：

1	./allfilter.sh mutable tailrec main println

性能：

1 2	ls *.scala \| wc 89 89 2030

在89个scala文件中，我有关键词分布：

1
2
3
4
5

for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done
16
34
41
71

使用稍微修改过的脚本版本搜索它们，允许使用filepattern作为第一个参数，大约需要0.2秒：

1
2
3
4
5
6
7
8
9
10
11

time ./allfilter.sh"*.scala" mutable tailrec main println
Filepattern: *.scala Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala

real 0m0.216s
user 0m0.024s
sys 0m0.028s

在接近15000条代码行中：

1 2	cat *.scala \| wc 14913 81614 610893

更新：

在阅读了对这个问题的评论之后，我们可能会谈论上千个模式，把它们作为参数来处理似乎不是一个聪明的主意；最好从文件中读取它们，并将文件名作为参数传递——也许文件列表也要过滤：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

#!/bin/bash

filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"

fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo"$file"
else
shift
pattern=$1
shift
grep -m1 -q"$pattern""$file" && fileMatchesAllNames"$file" $@
fi
}

echo -e"Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
test -f"$file" && fileMatchesAllNames"$file" $patterns
done

如果模式/文件的数量和长度超过了参数传递的可能性，则模式列表可以拆分为多个模式文件并在循环中处理(例如20个模式文件)：

1
2
3
4

for i in {1..20}
do
./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done

相关讨论

你可以

利用grep的-o--only-matching选项(强制只输出匹配线的匹配部分，每个部分都在单独的输出线上)。
然后用sort -u消除匹配字符串的重复出现，
最后检查剩余行的计数是否等于输入字符串的计数。

论证：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

$ cat input
...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

$ grep -o -F $'string1
string2
string3' input|sort -u|wc -l
3

$ grep -o -F $'string1
string3' input|sort -u|wc -l
2

$ grep -o -F $'string1
string2
foo' input|sort -u|wc -l
2

这个解决方案的一个缺点(不满足部分匹配应该是正常的要求)是grep没有检测到重叠匹配。例如，虽然文本abcd与abc和bcd都匹配，但grep只找到其中一个：

1
2
3
4
5
6
7

$ grep -o -F $'abc
bcd' <<< abcd
abc

$ grep -o -F $'bcd
abc' <<< abcd
abc

请注意，此方法/解决方案仅适用于固定字符串。它不能扩展到regex，因为一个regex可以匹配多个不同的字符串，并且我们不能跟踪哪个匹配对应于哪个regex。最好是将匹配项存储在一个临时文件中，然后使用一次一个regex多次运行grep。

作为bash脚本实现的解决方案：

MatCHALAL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

#!/usr/bin/env bash

if [ $# -lt 2 ]
then
echo"Usage: $(basename"$0") input_file string1 [string2 ...]"
exit 1
fi

function find_all_matches()
(
infile="$1"
shift

IFS=$'
'
newline_separated_list_of_strings="$*"
grep -o -F"$newline_separated_list_of_strings""$infile"
)

string_count=$(($# - 1))
matched_string_count=$(find_all_matches"$@"|sort -u|wc -l)

if ["$matched_string_count" -eq"$string_count" ]
then
echo"ALL strings matched"
exit 0
else
echo"Some strings DID NOT match"
exit 1
fi

论证：

1
2
3
4
5
6
7
8
9
10
11

$ ./matchall
Usage: matchall input_file string1 [string2 ...]

$ ./matchall input string1 string2 string3
ALL strings matched

$ ./matchall input string1 string2
ALL strings matched

$ ./matchall input string1 string2 foo
Some strings DID NOT match

检查文件是否都有三个模式的最简单方法是只获取匹配的模式，只输出唯一的部分和计数行。然后您可以使用一个简单的测试条件来检查它：test 3 -eq $grep_lines。

1	grep_lines=$(grep -Eo 'string1\|string2\|string3' file \| uniq \| wc -l)

关于您的第二个问题，我认为一旦找到多个模式，就不可能停止读取文件。我已经阅读了grep的手册页，没有任何可以帮助你的选择。您只能使用选项grep -m [number]在特定行之后停止读取行，无论匹配的模式如何，都会发生这种情况。

很确定为此需要一个自定义函数。

相关讨论

这是一个有趣的问题，在grep的手册页中没有任何明显的地方可以给出一个简单的答案。可能有一个疯狂的regex可以做到这一点，但通过一个简单的greps链可能会更清晰，即使这最终会扫描文件n次。至少-q选项在每次第一次匹配时都会使其保释，如果找不到其中一个字符串，则&；将快捷计算。

1
2
3
4
5
6
7

$grep -Fq string1 t && grep -Fq string2 t && grep -Fq string3 t
$echo $?
0

$grep -Fq string1 t && grep -Fq blah t && grep -Fq string3 t
$echo $?
1

相关讨论

1	perl -lne '%m = (%m, map {$_ => 1} m!\b(string1\|string2\|string3)\b!g); END { print scalar keys %m == 3 ?"Match":"No Match"}' file

相关讨论

忽略"有没有可能不…或者使用像awk或python这样的工具？要求，您可以使用Perl脚本来完成：

(为您的系统使用适当的shebang或类似于/bin/env perl的工具)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

#!/usr/bin/perl

use Getopt::Std; # option parsing

my %opts;
my $filename;
my @patterns;
getopts('rf:',\%opts); # Allowing -f <filename> and -r to enable regex processing

if ($opts{'f'}) { # if -f is given
$filename = $opts{'f'};
@patterns = @ARGV[0 .. $#ARGV]; # Use everything else as patterns
} else { # Otherwise
$filename = $ARGV[0]; # First parameter is filename
@patterns = @ARGV[1 .. $#ARGV]; # Rest is patterns
}
my $use_re= $opts{'r'}; # Flag on whether patterns are regex or not

open(INF,'<',$filename) or die("Can't open input file '$filename'");

while (my $line = <INF>) {
my @removal_list = (); # List of stuff that matched that we don't want to check again
for (my $i=0;$i <= $#patterns;$i++) {
my $pattern = $patterns[$i];
if (($use_re&& $line =~ /$pattern/) || # regex match
(!$use_re&& index($line,$pattern) >= 0)) { # or string search
push(@removal_list,$i); # Mark to be removed
}
}
# Now remove everything we found this time
# We need to work backwards to keep us from messing
# with the list while we're busy
for (my $i=$#removal_list;$i >= 0;$i--) {
splice(@patterns,$removal_list[$i],1);
}
if (scalar(@patterns) == 0) { # If we don't need to match anything anymore
close(INF) or warn("Error closing '$filename'");
exit(0); # We found everything
}
}
# End of file

close(INF) or die("Error closing '$filename'");
exit(1); # If we reach this, we haven't matched everything

保存为matcher.pl，这将搜索纯文本字符串：

1	./matcher filename string1 string2 string3 'complex string'

这将搜索正则表达式：

1	./matcher -r filename regex1 'regex2' 'regex4'

(文件名可以用-f代替)：

1	./matcher -f filename -r string1 string2 string3 'complex string'

它仅限于单行匹配模式(由于处理文件线条)。

从shell脚本调用大量文件时，性能比awk慢(但是搜索模式可以包含空格，不像在-v中传递给awk的空格)。如果转换为函数并从Perl代码调用(文件包含要搜索的文件列表)，它应该比大多数awk实现快得多。(当对几个小文件调用时，Perl启动时间(脚本的解析等)支配时间)

无论是否使用正则表达式，都可以通过硬编码显著加快速度，代价是灵活性。(请看我的基准，看看删除Getopt::Std有什么效果)

相关讨论

也许和GNU SED一起

猫匹配词

1
2
3
4
5
6
7
8
9
10
11
12

sed -z '
/\b'"$2"'/!bA
/\b'"$3"'/!bA
/\b'"$4"'/!bA
/\b'"$5"'/!bA
s/.*/0
/
q
:A
s/.*/1
/
'"$1"

你这样称呼它：

1	./match_word.sh infile string1 string2 string3

如果找到所有匹配项，则返回0，否则返回1

在这里你可以找4根弦

如果您需要更多，可以添加类似

1	/\b'"$x"'/!bA

相关讨论

为了"解决方案的完整性"，您可以使用不同的工具，避免使用多个grep和awk/sed或大(可能是慢)shell循环；这样的工具是agrep。

agrep实际上是一种egrep支持模式间的and操作，使用;作为模式分隔符。

像egrep和大多数众所周知的工具一样，agrep是一种在记录/行上操作的工具，因此我们仍然需要一种方法将整个文件作为单个记录来处理。此外，agrep还提供了一个-d选项来设置自定义记录分隔符。

一些测试：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

$ cat file6
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3

$ agrep -d '$$
' 'str3;str2;str1;str4' file6;echo $?
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3
0

$ agrep -d '$$
' 'str3;str2;str1;str4;str5' file6;echo $?
1

$ agrep -p 'str3;str2;str1' file6 #-p prints lines containing all three patterns in any position
str1 str2 str3
str3 str1 str2

没有一个工具是完美的，agrep也有一些限制；不能使用超过32个字符的regex/模式，并且在与regexps一起使用时某些选项不可用-所有这些都在agrep手册页中解释。

还有一个Perl变量-只要所有给定的字符串都匹配..即使文件是半读的，处理过程也会完成并只打印结果

1
2
3
4

> perl -lne ' /\b(string1|string2|string3)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ?"Match":"No Match"}' all_match.txt
Match
> perl -lne ' /\b(string1|string2|stringx)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ?"Match":"No Match"}' all_match.txt
No Match

假设您要检查的所有字符串都在一个文件strings.txt中，并且您要签入的文件是input.txt，那么下面的一行程序将执行以下操作：

根据评论更新了答案：

1	$ diff <( sort -u strings.txt ) <( grep -o -f strings.txt input.txt \| sort -u )

说明：

使用grep的-o选项只匹配您感兴趣的字符串。这将给出文件input.txt中存在的所有字符串。然后使用diff获取未找到的字符串。如果找到所有字符串，结果将为零。或者，只需检查diff的退出代码。

它没有做什么：

找到所有匹配项后立即退出。
可扩展到regx。
重叠的匹配项。

它的作用是：

查找所有匹配项。
给grep打一个电话。
不使用awk或python。

相关讨论

在python中，使用fileinput模块可以在命令行或从stdin逐行读取文本中指定文件。您可以将字符串硬编码为一个python列表。

1
2
3
4
5
6
7

# Strings to match, must be valid regular expression patterns
# or be escaped when compiled into regex below.
strings = (
r'string1',
r'string2',
r'string3',
)

或者从另一个文件读取字符串

1
2
3
4
5
6
7
8
9
10
11
12
13

import re
from fileinput import input, filename, nextfile, isfirstline

for line in input():
if isfirstline():
regexs = map(re.compile, strings) # new file, reload all strings

# keep only strings that have not been seen in this file
regexs = [rx for rx in regexs if not rx.match(line)]

if not regexs: # found all strings
print filename()
nextfile()

这些答案中的许多都是正确的。

但是，如果性能是一个问题——如果输入量很大，并且有成千上万个模式，那么使用像lex或flex这样的工具，将生成真正的确定性有限自动机作为识别器，而不是对每个模式调用一次regex解释器，可以大大加快速度。

有限自动机将对每个输入字符执行一些机器指令，而不考虑模式的数量。

一个无装饰的柔性解决方案：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

%{
void match(int);
%}
%option noyywrap

%%

"abc" match(0);
"ABC" match(1);
[0-9]+ match(2);
/* Continue adding regex and exact string patterns... */

[ \t
] /* Do nothing with whitespace. */
. /* Do nothing with unknown characters. */

%%

// Total number of patterns.
#define N_PATTERNS 3

int n_matches = 0;
int counts[10000];

void match(int n) {
if (counts[n]++ == 0 && ++n_matches == N_PATTERNS) {
printf("All matched!
");
exit(0);
}
}

int main(void) {
yyin = stdin;
yylex();
printf("Only matched %d patterns.
", n_matches);
return 1;
}

不利的一面是，您必须为每一组给定的模式构建这个模型。这还不错：

1 2	flex matcher.y gcc -O lex.yy.c -o matcher

现在运行它：

1	./matcher < input.txt

相关讨论

我敢打赌，与awk解决方案相比，您不会得到性能的改进，因为awk对于文本处理是高度优化的，通常在执行编译语言(如"c"或该任务)时效果不佳。记住，我们在寻找字符串btw，而不是regexps。
@埃德莫顿：对于非regex，基于这个基于mmaped文件的C版本在大约1.5秒内运行我的基准测试…(对搜索字符串使用argv，而不是文件)(将简单的c大小写扩展到regex并不是那么容易的……)
同样重要的是要像OP所说的那样搜索1000多个字符串是现实的。比如说，搜索3个字符串在比较计时方面并不是很有用。
@Edmorton awk是一个很好的工具，但它使用一个regex解释器。在5000个模式上，它将继续尝试为每个输入字符按顺序匹配5000个模式中的每一个。flex将把所有5000个模式编译成一个单独的dfa，每个输入字符执行几个指令。这就是为什么编译器扫描器——性能会影响扫描器生命周期中编译的每个程序——是用dfas而不是regex引擎实现的。
在这个问题中，我们不使用regexps，而是使用字符串。没有使用awks regexp引擎，即使使用了它，也不会使用单个regexp，并且使用的算法将减少每次找到regexp时的regexp数量。
@Edmorton的问题是"我们可以用正则表达式代替字符串"，无论模式是简单的字符串，还是使用交替和Kleene星，我的观点都是一样的。对于每个字符，解释器将对每个字符按顺序尝试每个模式，直到找到匹配。DFA将在每个字符上执行相当于c switch分支的操作，然后继续执行下一个分支。
我认为这意味着字符串可以包含regexp元字符，但无论如何，它们都应该被视为文本字符串。这很好，但对于在一个文件中查找多个字符串的问题来说，这不是一个合理的awk解决方案，它只需遍历这些字符串的数组，并删除在输入文件中找到的每个字符串，直到数组为空或到达文件结尾。如果你认为你的方法会更快，那就试试看，让我们知道结果。

对于普通速度，没有外部工具限制，也没有regex，这个(粗糙的)C版本做得很好。(可能仅限于Linux，尽管它应该在所有与mmap类似的Unix系统上工作)

#include <sys/mman.h>
#include <sys/stat.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

/* https://stackoverflow.com/a/8584708/1837991 */
inline char *sstrstr(char *haystack, char *needle, size_t length)
{
size_t needle_length = strlen(needle);
size_t i;
for (i = 0; i < length; i++) {
if (i + needle_length > length) {
return NULL;
}
if (strncmp(&haystack[i], needle, needle_length) == 0) {
return &haystack[i];
}
}
return NULL;
}

int matcher(char * filename, char ** strings, unsigned int str_count)
{
int fd;
struct stat sb;
char *addr;
unsigned int i = 0; /* Used to keep us from running of the end of strings into SIGSEGV */

fd = open(filename, O_RDONLY);
if (fd == -1) {
fprintf(stderr,"Error '%s' with open on '%s'
",strerror(errno),filename);
return 2;
}

if (fstat(fd, &sb) == -1) { /* To obtain file size */
fprintf(stderr,"Error '%s' with fstat on '%s'
",strerror(errno),filename);
close(fd);
return 2;
}

if (sb.st_size <= 0) { /* zero byte file */
close(fd);
return 1; /* 0 byte files don't match anything */
}

/* mmap the file. */
addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (addr == MAP_FAILED) {
fprintf(stderr,"Error '%s' with mmap on '%s'
",strerror(errno),filename);
close(fd);
return 2;
}

while (i++ < str_count) {
char * found = sstrstr(addr,strings[0],sb.st_size);
if (found == NULL) { /* If we haven't found this string, we can't find all of them */
munmap(addr, sb.st_size);
close(fd);
return 1; /* so give the user an error */
}
strings++;
}
munmap(addr, sb.st_size);
close(fd);
return 0; /* if we get here, we found everything */
}

int main(int argc, char *argv[])
{
char *filename;
char **strings;
unsigned int str_count;
if (argc < 3) { /* Lets count parameters at least... */
fprintf(stderr,"%i is not enough parameters!
",argc);
return 2;
}
filename = argv[1]; /* First parameter is filename */
strings = argv + 2; /* Search strings start from 3rd parameter */
str_count = argc - 2; /* strings are two ($0 and filename) less than argc */

return matcher(filename,strings,str_count);
}

编译时使用：

1	gcc matcher.c -o matcher

与它一起运行：

1	./matcher filename needle1 needle2 needle3

信用：

使用SSTRSTR
文件处理大多是从mmap手册页上被盗的

笔记：

它将对匹配字符串之前的文件部分进行多次扫描，但只打开文件一次。
整个文件可能最终加载到内存中，特别是如果字符串不匹配，操作系统需要决定
可以通过使用posix regex库添加regex支持(性能可能比grep稍好一点-它应该基于同一个库，并且只打开一次文件搜索多个regex会减少开销)
包含空值的文件应该可以工作，但不能用空值搜索字符串…
除NULL以外的所有字符都应可搜索(
、等)

下面的python脚本应该可以做到这一点。它确实对每一行多次调用等效的grep(re.search)，也就是说，它搜索每一行的每个模式，但是由于您不每次都分叉一个进程，所以它应该更高效。此外，它还删除已经找到的模式，并在找到所有模式后停止。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

#!/usr/bin/env python

import re

# the file to search
filename = '/path/to/your/file.txt'

# list of patterns -- can be read from a file or command line
# depending on the count
patterns = [r'py.*$', r'\s+open\s+', r'^import\s+']
patterns = map(re.compile, patterns)

with open(filename) as f:
for line in f:
# search for pattern matches
results = map(lambda x: x.search(line), patterns)

# remove the patterns that did match
results = zip(results, patterns)
results = filter(lambda x: x[0] == None, results)
patterns = map(lambda x: x[1], results)

# stop if no more patterns are left
if len(patterns) == 0:
break

# print the patterns which were not found
for p in patterns:
print p.pattern

如果您处理的是普通(非regex)字符串，则可以添加对普通字符串(string in line的单独检查，这样效率会稍高一些。

那能解决你的问题吗？

1 2	$ cat allstringsfile \| tr ' ' ' ' \| awk -f awkpattern1

其中allStringsFile是文本文件，如原始问题中所示。awkpattern1包含字符串模式，条件为：

1 2	$ cat awkpattern1 /string1/ && /string2/ && /string3/

相关讨论

我在答案中没有看到一个简单的计数器，所以这里有一个反方向的解决方案，它使用awk，当所有匹配都满足时立即停止：

1
2
3
4
5
6
7
8
9

/string1/ { a = 1 }
/string2/ { b = 1 }
/string3/ { c = 1 }
{
if (c + a + b == 3) {
print"Found!";
exit;
}
}

通用脚本

要通过shell参数扩展用法，请执行以下操作：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

#! /bin/sh
awk -v vars="$*" -v argc=$# '
BEGIN { split(vars, args); }
{
for (arg in args) {
if (!temp[arg] && $0 ~ args[arg]) {
inc++;
temp[arg] = 1;
}
}

if (inc == argc) {
print"Found!";
exit;
}
}
END { exit 1; }
' filename

用法(可以在其中传递正则表达式)：

1	./script"str1?""(wo)?men" str3

或者应用一系列模式：

1	./script"str1? (wo)?men str3"

相关讨论

@格特万德伯格，谢谢你，完成了。
我认为现在的更新考虑了正确的退出状态。
操作告诉我们，她有数千个字符串要搜索，所以像在第一个脚本中那样对它们进行硬编码是不实际的(加上它是在搜索regexps，而不是字符串)。您的第二个脚本在使用计数器方面与我的脚本类似，但与我的脚本不同，您的脚本没有像我的脚本那样减少每个循环上的字符串数组，而是创建了一个重复的字符串数组，因此比我的脚本慢，并且使用的内存是我的两倍。不是我投了反对票。
我在哪里创建重复的字符串数组？我不理解*它是在搜索regexps，而不是字符串*，因为无论输入是什么，模式还是文字字符串，它都不会混淆。@ EdMorton
在第一个脚本中，/string/使用内容string定义repexp文本，然后将其与$0进行regexp比较，即搜索的是regexp而不是字符串。在第二个脚本中，您有一个包含所有字符串的数组args[]，每次您匹配输入中的字符串时，都会将其添加到数组temp[]，因此当在输入中找到所有字符串时，您最终会得到temp[]是args[]的副本。
哦，等等，我看错了你的第二个剧本。我假设您使用args[]将arg字符串作为索引来包含，这样您就可以像我这样对索引和字符串比较进行简单的循环，但您不是将字符串存储为数组内容，并在数字索引上循环，然后每次取消对数组的引用，并进行regexp比较！所以第二个脚本中的arg实际上不是arg(传入的字符串之一)，它是一个索引，arg/string实际上在args[arg]上，所以您不是在temp[]中创建dup，而是有其他问题。
我猜你可能是在条件…但我们来谈谈字符串和regexps。我认为这是一个单一的解决方案。您单独提供了两个，我不确定它在字符串和regexp上的效果如何。也许需要一个基准？@ EdMorton
当搜索的字符串是.*时，尝试两种解决方案。如果输入文件中存在字符串.*，那么我将找到确切的字符串EDOCX1(即句点后接星号)，而您的字符串将在第一行匹配，无论.*以来的输入文件内容将被视为regexp"任何字符的零个或多个重复"。wrt efficiency，我不考虑regexp与字符串比较的效率，当数组匹配时，我会从数组中删除每个字符串，这样数组会变小，在每次匹配时循环的速度也会更快。如果你愿意的话，可以随意做定时测试。
这不取决于手术吗？对于.*，你是对的，但是如果op在他/她的论点中没有量词或regex元字符，那么问题是什么？而且，由于一次读取一行，即使打印错误子串的.*，从性能上讲，也与匹配像s这样的文本字符串没有区别。即使整个输入字符串是一条单行，
的平均值是rs.@edmorton
是的，如果op选择不使用任何regexp元字符，那么regexp元字符不会产生任何错误匹配，但是当您可以轻松地进行字符串比较时，为什么还要施加这种限制呢？操作人员说她想找到字符串，所以为什么要说"好的，但你不能简单地在文件名列表中查找"foo.c"，等等。再说一次，当我谈论性能时，我不是在讨论字符串与regexp比较的区别，而是在讨论每次找到一个值时减少循环的值的数量，而不是保持迭代次数不变。
要在不使用regex的情况下进行搜索，可以使用index(参见本地awk手册页)而不是=~。两种变体是一种选择，另一种选择是使用index还是=~。
@gertvandenberg是的，你可以使用index()来做一个字符串，而不是regexp比较，你可以从args中删除匹配的字符串，而不是向temp中添加标志来提高性能，最后得到我的答案。