关于性能：如何获取文件中的最大数量？

How to get the biggest number in a file?

我想获取文件中的最大数字，其中数字是可以在文件任何位置出现的整数。

我考虑过要进行以下操作：

1	grep -o '[-0-9]*' myfile \| sort -rn \| head -1

这使用grep从文件中获取所有整数，每行输出一个。然后，sort对它们进行排序，而head打印第一个。

但是后来认为sort -r可能会导致一些开销，所以我去了：

1	grep -o '[-0-9]*' myfile \| sort -n \| tail -1

为了查看最快的速度，我创建了一个包含一些随机数据的大文件，如下所示：

1
2
3
4
5

$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ for i in {1..50000}; do cat a >> myfile ; done

文件包含15万行。

现在，我比较GNU bash version 4.2和sys的性能对于sort -rn而言要小得多：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

$ time grep -o '[-0-9]*' myfile | sort -n | tail -1
42342234

real 0m1.823s
user 0m1.865s
sys 0m0.045s

$ cp myfile myfile2 #to prevent using cached info
$ time grep -o '[-0-9]*' myfile2 | sort -rn | head -1
42342234

real 0m1.864s
user 0m1.926s
sys 0m0.027s

所以我在这里有两个问题：

最好是sort -r | tail -1或sort -rn | head -1？
有没有最快的方法来获取给定文件中的最大整数？

测试解决方案

因此，我运行了所有命令，并比较了它们获取值的时间。为了使事情更可靠，我创建了一个更大的文件，该文件的大小是我在问题中提到的文件的10倍：

1
2
3
4
5
6
7

$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=500000;i++) print s}' > myfile
$ wc myfile
1500000 13000000 62000000 myfile

基准测试，从中可以看到hek2mgl的解决方案是最快的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' myfile
42342234

real 0m3.979s
user 0m3.970s
sys 0m0.007s
$ time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' myfile
42342234

real 0m2.203s
user 0m2.196s
sys 0m0.006s
$ time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
42342234

real 0m0.926s
user 0m0.848s
sys 0m0.077s
$ time tr ' ' '\
' < myfile | sort -rn | head -1
42342234

real 0m11.089s
user 0m11.049s
sys 0m0.086s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' myfile

real 0m6.166s
user 0m6.146s
sys 0m0.011s

相关讨论

我对awk在这里的速度感到惊讶。 perl通常非常快速，但是：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

$ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand

$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
32767

real 0m0.890s
user 0m0.887s
sys 0m0.003s

$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' rand
32767

real 0m1.110s
user 0m1.107s
sys 0m0.002s

我想我找到了一个赢家：使用perl，将文件作为单个字符串处理，找到(可能是负数)整数，并采用max：

1
2
3
4
5
6

$ time perl -MList::Util=max -0777 -nE 'say max /-?\\d+/g' rand
32767

real 0m0.565s
user 0m0.539s
sys 0m0.025s

花费更多的" sys"时间，但是减少了实时时间。

也可以处理仅带有负数的文件：

1
2
3
4

$ cat file
hello -42 world
$ perl -MList::Util=max -0777 -nE 'say max /-?\\d+/g' file
-42

相关讨论

在awk中，您可以说：

1	awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file

解释

根据我的经验，awk是用于大多数任务的最快的文本处理语言，而我所见的唯一速度可比的东西(在Linux系统上)是用C / C编写的程序。

在上面的代码中，使用最少的功能和命令将加快执行速度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
this way is usually faster than using custom ones as awk is optimised
to use the default

if(int($i)) - Checks if the field is not equal to zero and as strings are set to zero
by int, does not execute the next block if the field is a string. I
believe this is the quickest way to perform this check

{a[$i]=$i} - Sets an array variable with the number as key and value. This means
there will only be as many array variables as there are numbers in
the file and will hopefully be quicker than a comparison of every
number

END{x=asort(a) - At the end of the file, use asort on the array and store the s
size of the array in x.

print a[x] - Print the last element in the array.

基准

矿山：

1	time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file

接

1
2
3

real 0m0.434s
user 0m0.357s
sys 0m0.008s

hek2mgl's：

1	awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file

接

1
2
3

real 0m1.256s
user 0m1.134s
sys 0m0.019s

对于那些想知道为什么它更快的原因是使用默认的FS和RS，而awk已针对使用

进行了优化

更改

1	awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'

到

1	awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'

提供时间

1
2
3

real 0m0.574s
user 0m0.497s
sys 0m0.011s

哪个仍然比我的命令慢一些。

我相信仍然存在的细微差异是由于asort()仅处理大约6个数字，因为它们仅在数组中保存一次。

相比之下，另一个命令正在对文件中的每个数字执行比较，这将在计算上更加昂贵。

我认为，如果文件中的所有数字都是唯一的，它们的速度将大致相同。

汤姆·费内奇(Tom Fenech)：

1
2
3
4
5

time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile

real 0m0.716s
user 0m0.612s
sys 0m0.013s

这种方法的缺点是，如果所有数字都小于零，则max将为空白。

格伦·杰克曼(Glenn Jackman)：

1
2
3
4
5

time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file

real 0m1.492s
user 0m1.258s
sys 0m0.022s

和

1
2
3
4
5

time perl -MList::Util=max -0777 -nE 'say max /-?\\d+/g' file

real 0m0.790s
user 0m0.686s
sys 0m0.034s

关于perl -MList::Util=max -0777 -nE 'say max /-?\\d+/g'的好处是，这是唯一的答案，如果文件中最大的数字为0，该答案将起作用，如果所有数字均为负，则该答案也将起作用。

笔记

所有时间均代表3次测试的平均值

相关讨论

我怀疑这将是最快的：

1
2
3

$ tr ' ' '\
' < file | sort -rn | head -1
42342234

第三次运行：

1
2
3
4
5
6

$ time tr ' ' '\
' < file | sort -rn | head -1
42342234
real 0m0.078s
user 0m0.000s
sys 0m0.076s

btw即使编写示例输入文件，也不要写壳框来操纵文本：

1
2
3
4
5
6
7
8

$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfile

real 0m0.109s
user 0m0.031s
sys 0m0.061s

$ wc -l myfile
150000 myfile

与问题中建议的shell循环相比：

1
2
3
4
5
6
7
8

$ time for i in {1..50000}; do cat a >> myfile2 ; done

real 26m38.771s
user 1m44.765s
sys 17m9.837s

$ wc -l myfile2
150000 myfile2

如果我们想要更强大的功能来处理包含非整数字符串中的数字的输入文件，则需要这样的内容：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

$ cat b
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
73 starts a line
avoid these: 3.14 or 4-5 or $15 or 2:30 or 05/12/2015

$ grep -o -E '(^| )[-]?[0-9]+( |$)' b | sort -rn
42342234
3624
123
73
-23

$ time awk -v s="$(cat b)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfileB
real 0m0.109s
user 0m0.000s
sys 0m0.076s

$ wc -l myfileB
250000 myfileB

$ time grep -o -E '(^| )-?[0-9]+( |$)' myfileB | sort -rn | head -1 | tr -d ' '
42342234
real 0m2.480s
user 0m2.509s
sys 0m0.108s

请注意，输入文件的行数比原始文件多，使用此输入，上述可靠的grep解决方案实际上比我在此问题开始时发布的原始文件要快：

1
2
3
4
5
6

$ time tr ' ' '\
' < myfileB | sort -rn | head -1
42342234
real 0m4.836s
user 0m4.445s
sys 0m0.277s

相关讨论

我敢肯定，使用汇编程序优化的C实现将是最快的。我还可以想到一个程序，它将文件分成多个块，然后将每个块映射到单个处理器内核，然后获取nproc剩余数量的最大值。

仅使用现有的命令行工具，您是否尝试过awk？

1	time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile

与公认的答案中的perl命令相比，它可以在大约50％的时间内完成这项工作：

1
2
3
4

time perl -MList::Util=max -0777 -nE 'say max /-?\\d+/g' myfile
cp myfile myfile2

time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile2

给我：

1
2
3
4
5
6
7
8
9
10

42342234

real 0m0.360s
user 0m0.340s
sys 0m0.020s
42342234

real 0m0.193s <-- Good job awk! You are the winner.
user 0m0.185s
sys 0m0.008s

相关讨论

这是一个很好的答案，我很惊讶：我不会想到awk会比sort和tail/head等标准命令更快地执行此操作(可能是grep占用了大部分时间)。
我要说的是最多消耗时间的sort命令，至少与grep差不多。.这简直是比awk更多的迭代。
我只运行了time grep -o '[-0-9]*' myfile &>/dev/null，它需要real\t0m1.534s, user\t0m1.530s, sys\t0m0.001s！
@fedorqui您为什么感到惊讶？ awk的作用是O(N)，而sort的作用是O(N log(N))。
@ lcd047但是，排序不能在相同数量的行上进行。但是，您是对的，正是迭代的次数在这里有所不同。
我不明白您的记录分隔符：空格还是星号？您是否要RS='[^[:digit:]-]+'？
@glennjackman这基本上是拼写错误。应该是[[:space:]]+。但是我写那一刻的时候，我仍然认为文件包含仅用空格分隔的数字。该错误没有弹出，因为int("foobar")将评估为0。感谢您的提示！
非常感谢您的有趣方法，嘿！您是如何处理RS的高手，我有很多要向您学习：)最后，Glenn \\'s是最快的解决方案。
欢迎您@fedorqui。真有趣！还有..(鼓)，我们有一个新的冠军！！ awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='[0-9]+' myfile。我从EdMorten那里学到的FPAT-awk的所有知识的掌握者:)
FPAT实际上是不正确的，因为它不处理负数。.将在工作后对其进行优化。还将添加一个基准以将其与Glenn的答案进行比较。 (我确实这样做了。它需要大约50％的perl解决方案)
@fedorqui更新了它。我们有一个新的冠军！ :)
呵呵，我的时间安排：Perl 0.657s真实，gawk 0.981s真实
我可以一次又一次地在桌面上重现该结果。我现在在家。让我在这里尝试。
@glennjackman在家perl 0.945，awk 0.414。您是否使用了与问题中相同的myfile？
不，我使用的是一百万个随机正整数的文件。
区别很有趣。但是我们都应该使用相同的文件进行基准测试。 (或不同myfile的当前时间)请注意，我将在约2.5小时后回来。
@glennjackman因为更改了输入数据以及问题本身，所以您的结果确实令人产生误解。有了您的输入数据，即rand文件，我将使用awk '{m=(m<$0)?$0:m}END{print m}'，它占perl解决方案的?33％。 (0.910 perl，0.342 awk).. perl为此付出了更多-最终，这并不令人惊讶，因为perl会做更多的事情。顺便说一句，sort -n rand | tail -n1接受了0.275
awk 'm<$0{m=$0}END{print m}' rand比我上一条评论中的awk命令快一点。花了0.313。