关于awk：如何从许多文件中删除模式

how to remove a pattern from many files

这是我的档案。

1
2
3
4
5
6
7
8
9
10

...


<script type="text/javascript"
src="../src/goog/ga_body.js">

</body>
</html>
...

如何删除包括和在内的所有内容？如此有效地说：

1
2
3
4

<script type="text/javascript"
src="../src/goog/ga_body.js">

将会消失。这将被留下，也就是说，这不是什么，4行将被替换为什么。

1
2
3
4

我正在考虑在bash中进行，所以sed和awk可能是我的最佳选择，尽管python可能更好。

编辑1

这是我以前写过的，但可能是非常糟糕的编码，我将完成这个find2PatternsAndDeleteTextInBetween.sh：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

#HEre I want to find 2 patterns and delete whats in between
#this example works

#this is the 2 patterns I want to fine Start and End
#have to use some escape characters here for this to show properly
# have to use
for it to appear in this format
#
# text would go here
#>

#b=""

#b2=""

#p1="PATTERN-1"
#p2="PATTERN-2"
p1=""
p2=""
fname="*.html"
num_of_files_pattern1=ls #grep $p1 fname

echo"fname(s) to apply the sed to:"
echo $fname
echo"num_of_files_pattern1 is:"
echo $num_of_files_pattern1

echo"Pattern1 is equal to:"
echo $p1

echo"Pattern2 is equal to:"
echo $p2

#this is current dir where the script is
DIR="$( cd"$( dirname"${BASH_SOURCE[0]}" )" && pwd )"
echo"DIR is equal to:"
echo $DIR

#cd to the dir where I want to copy the files to:
cd"$DIR"

# this will find the pattern <\head> in all the .html files and place"This should appear before the closing head tag" this before it
# it will also make a backup with .bak extension
#sed -i.bak '/<\\head>/i\This should appear before the closing head tag' *.html

echo"sed on the file"
# this does the head part
#sed '/PATTERN-1/,/PATTERN-2/d' *.txt # this works
#sed"/$p1/,/$p2/d" *.txt # this works
#sed"/$p1/,/$p2/d" $fname # this works
sed -i.bak"/$p1/,/$p2/d" $fname # this works

编辑2

这就是我最后得出的结论，但下面有一个更有力的答案：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

# ------------------------------------------------------------------
# [author] find2PatternsAndDeleteTextInBetween.sh
# Description
# Here I want to find 2 patterns and delete what's in between
# this example works
#
# EXAMPLE:
# this is the 2 patterns I want to find Start and End
# 
# text would go here
# >
#
# ------------------------------------------------------------------
p1=""
p2=""
fname=".html"
echo"fname(s) to apply the sed to:"
echo *"$fname"
echo -e"
"
echo"Pattern1 is equal to:"
echo -e"$p1
"
echo"Pattern2 is equal to:"
echo -e"$p2
"
echo -e"PWD is: $PWD
"
echo"sed on the file"
#sed '/PATTERN-1/,/PATTERN-2/d' *.txt # this works
#sed"/$p1/,/$p2/d" *.txt # this works
#sed"/$p1/,/$p2/d" $fname # this works
sed -i.bak"/$p1/,/$p2/d" *"$fname" # this works

需要考虑的事项：

1
2
3
4
5
6
7

$ awk '//{f=!f;next} !f' file
...

</body>
</html>
...

sed用于此任务

1	$ sed -i'.bak' '/<!--START/,/<!--END/d' file

如果您有其他带有类似标签的行，请添加更多的图案。

对于多个文件，例如file1、..file4

1	$ for f in file{1..4}; do sed -i'.bak' '/<!--START/,/<!--END/d'"$f"; done

从您问题中的脚本来看，您似乎已经知道如何使用sed从单个文件中删除感兴趣的范围(sed -i.bak"/$p1/,/$p2/d" $fname)，但正在寻找一种在脚本中处理多个文件的可靠方法(假设bash)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

#!/usr/bin/env bash

# cd to the dir. in which this script is located.
# CAVEAT: Assumes that the script wasn't invoked through a *symlink*
# located in a different dir.
cd --"$(dirname --"$BASH_SOURCE")" || exit

fpattern='*.html' # specify source-file globbing pattern
shopt -s failglob # make sure that globbing expands to nothing if nothing matches
fnames=( $fpattern ) # expand to matching files and store in array
num_of_files_matching_pattern=${#fnames[@]} # count matching files
(( num_of_files_matching_pattern > 0 )) || exit # abort, if no files match

printf '%s
%s
'"Running from:""$PWD"
printf '%s
%s
'"Pattern matching the files to process:""$fpattern"
printf '%s
%s
'"# of matching files:""$num_of_files_matching_pattern"

# Determine the range-endpoint-identifier-line regular expressions.
# CAVEAT: Make sure you escape any regular-expression metacharacters you want
# to be treated as *literals*.
p1='^$'
p2='^$'

# Remove the range identified by its endpoints from all matching input files
# and save the original files with extension '.bak'
sed -i'.bak'"/$p1/,/$p2/d""${fnames[@]}" || exit

另外：我建议不要在脚本文件名中使用后缀.sh：

文件中的shebang行足以告诉系统将脚本传递给哪个shell/解释器。
如果不指定为后缀，则可以在以后自由地更改实现(例如，改为python)，而不会破坏依赖脚本的现有程序。
在目前的情况下，假设使用bash实际上是可以接受的，.sh将是误导性的，因为它建议使用sh只提供脚本。

确定正在运行的脚本的真实目录，即使通过位于不同目录中的symlink调用脚本：

如果您可以假设一个Linux平台(或者至少是GNU readlink平台)，请使用：

1
dirname --"$(readlink -e --"$BASH_SOURCE")"
否则，需要一个更复杂的辅助功能解决方案-请参阅我的答案。