关于awk:如何从许多文件中删除模式

how to remove a pattern from many files

这是我的档案。

1
2
3
4
5
6
7
8
9
10
...


<!--START: Google Analytics --->
<script type="text/javascript"
src="../src/goog/ga_body.js">
<!--END: Google Analytics --->
</body>
</html>
...

如何删除包括在内的所有内容?如此有效地说:

1
2
3
4
<!--START: Google Analytics --->
<script type="text/javascript"
src="../src/goog/ga_body.js">
<!--END: Google Analytics --->

将会消失。这将被留下,也就是说,这不是什么,4行将被替换为什么。

1
2
3
4
    <nothing here 4 lines deleted>

    </body>
    </html>

我正在考虑在bash中进行,所以sed和awk可能是我的最佳选择,尽管python可能更好。

编辑1

这是我以前写过的,但可能是非常糟糕的编码,我将完成这个find2PatternsAndDeleteTextInBetween.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#HEre I want to find 2 patterns and delete whats in between
#this example works


#this is the 2 patterns I want to fine Start and End
#have to use some escape characters here for this to show properly
# have to use
 for it to appear in this format
#<!-- Start of StatCounter Code for DoYourOwnSite -->
#  text would go here
#<!-- End of StatCounter Code for DoYourOwnSite -->>

#b="<!-- Start of StatCounter Code for DoYourOwnSite -->"

#b2="<!-- End of StatCounter Code for DoYourOwnSite -->"

#p1="PATTERN-1"
#p2="PATTERN-2"
p1="<!-- Start of StatCounter Code for DoYourOwnSite -->"
p2="<!-- End of StatCounter Code for DoYourOwnSite -->"
fname="*.html"
num_of_files_pattern1=ls #grep $p1 fname


echo"fname(s) to apply the sed to:"
echo $fname
echo"num_of_files_pattern1 is:"
echo $num_of_files_pattern1

echo"Pattern1 is equal to:"
echo $p1

echo"Pattern2 is equal to:"
echo $p2

#this is current dir where the script is
DIR="$( cd"$( dirname"${BASH_SOURCE[0]}" )" && pwd )"
echo"DIR is equal to:"
echo $DIR

#cd to the dir where I want to copy the files to:
cd"$DIR"

# this will find the pattern <\head> in all the .html files and place"This should appear before the closing head tag" this before it
# it will also make a backup with .bak extension
#sed -i.bak '/<\\head>/i\This should appear before the closing head tag' *.html

echo"sed on the file"
# this does the head part
#sed '/PATTERN-1/,/PATTERN-2/d' *.txt # this works
#sed"/$p1/,/$p2/d" *.txt # this works
#sed"/$p1/,/$p2/d" $fname # this works
sed -i.bak"/$p1/,/$p2/d" $fname # this works

编辑2

这就是我最后得出的结论,但下面有一个更有力的答案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# ------------------------------------------------------------------
# [author] find2PatternsAndDeleteTextInBetween.sh
#           Description
#           Here I want to find 2 patterns and delete what's in between
#           this example works
#
# EXAMPLE:
# this is the 2 patterns I want to find Start and End
# <!-- Start of StatCounter Code for DoYourOwnSite -->
#   text would go here
# <!-- End of StatCounter Code for DoYourOwnSite -->>
#
# ------------------------------------------------------------------
p1="<!--START: Google Analytics --->"
p2="<!--END: Google Analytics --->"
fname=".html"
echo"fname(s) to apply the sed to:"
echo *"$fname"
echo -e"
"
echo"Pattern1 is equal to:"
echo -e"$p1
"
echo"Pattern2 is equal to:"
echo -e"$p2
"
echo -e"PWD is: $PWD
"
echo"sed on the file"
#sed '/PATTERN-1/,/PATTERN-2/d' *.txt # this works
#sed"/$p1/,/$p2/d" *.txt # this works
#sed"/$p1/,/$p2/d" $fname # this works
sed -i.bak"/$p1/,/$p2/d" *"$fname" # this works

需要考虑的事项:

1
2
3
4
5
6
7
$ awk '/<!--(START|END): Google Analytics --->/{f=!f;next} !f' file
...


</body>
</html>
...

sed用于此任务

1
$ sed -i'.bak' '/<!--START/,/<!--END/d' file

如果您有其他带有类似标签的行,请添加更多的图案。

对于多个文件,例如file1、..file4

1
$ for f in file{1..4}; do sed -i'.bak' '/<!--START/,/<!--END/d'"$f"; done


从您问题中的脚本来看,您似乎已经知道如何使用sed从单个文件中删除感兴趣的范围(sed -i.bak"/$p1/,/$p2/d" $fname),但正在寻找一种在脚本中处理多个文件的可靠方法(假设bash)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env bash

# cd to the dir. in which this script is located.
# CAVEAT: Assumes that the script wasn't invoked through a *symlink*
#         located in a different dir.
cd --"$(dirname --"$BASH_SOURCE")" || exit

fpattern='*.html'     # specify source-file globbing pattern
shopt -s failglob     # make sure that globbing expands to nothing if nothing matches
fnames=( $fpattern )  # expand to matching files and store in array
num_of_files_matching_pattern=${#fnames[@]} # count matching files
(( num_of_files_matching_pattern > 0 )) || exit # abort, if no files match

printf '%s
%s
'"Running from:""$PWD"
printf '%s
%s
'"Pattern matching the files to process:""$fpattern"
printf '%s
%s
'"# of matching files:""$num_of_files_matching_pattern"

# Determine the range-endpoint-identifier-line regular expressions.
# CAVEAT: Make sure you escape any regular-expression metacharacters you want
#         to be treated as *literals*.
p1='^<!--START: Google Analytics --->$'
p2='^<!--END: Google Analytics --->$'

# Remove the range identified by its endpoints from all matching input files
# and save the original files with extension '.bak'
sed -i'.bak'"/$p1/,/$p2/d""${fnames[@]}" || exit

另外:我建议不要在脚本文件名中使用后缀.sh

  • 文件中的shebang行足以告诉系统将脚本传递给哪个shell/解释器。

  • 如果不指定为后缀,则可以在以后自由地更改实现(例如,改为python),而不会破坏依赖脚本的现有程序。

  • 在目前的情况下,假设使用bash实际上是可以接受的,.sh将是误导性的,因为它建议使用sh只提供脚本。

确定正在运行的脚本的真实目录,即使通过位于不同目录中的symlink调用脚本:

  • 如果您可以假设一个Linux平台(或者至少是GNU readlink平台),请使用:

    1
    dirname --"$(readlink -e --"$BASH_SOURCE")"
  • 否则,需要一个更复杂的辅助功能解决方案-请参阅我的答案。