关于Unix：使用awk或sed从文件中删除连续的重复单词

Remove consecutive duplicate words from a file using awk or sed

我的输入文件如下所示：

1 2	a€?true true, rohith Rohith; cold burn, and fact and fact good good?a€?

输出应如下所示：

1 2	"true, rohith Rohith; cold burn, and fact and fact good?"

我正在用awk尝试相同的操作，但是无法获得所需的结果。

1 2	awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s",$i,FS)}{printf("\ ")}' input.txt

有人可以在这里帮我吗。

关于，
罗伊斯

相关讨论

简单的sed：

1 2	echo"true true, rohith Rohith; cold burn, and fact and fact good good?" \| sed -r 's/(\\w+) (\\1)/\\1/g'

仅与sed中的相同反向引用匹配：

1	sed ':l; s/\$^\\\|[^[:alpha:]]\$\$[[:alpha:]]\\{1,\\}\$[^[:alpha:]]\\{1,\\}\\2\$$\\\|[^[:alpha:]]\$/\\1\\2\\3/g; tl'

工作方式：

:l-创建要跳转到的标签l。请参阅下面的tl。
s-替代
- /
- \$^\\|[^[:alpha:]]\$-匹配行首或非字母字符。这样，下一部分将与整个单词匹配，而不仅仅是后缀。
- \$[[:alpha:]]\\{1,\\}\$-匹配一个单词-一个或多个字母字符。
- [^[:alpha:]]\\{1,\\}-匹配一个非单词-一个或多个非字母字符。
- \\2-匹配第二个\$...\$中的相同内容-即。匹配单词。
- \$$\\|[^[:alpha:]]\$-匹配行尾或匹配非字母字符。如此一来，我们就能匹配整个第二个单词，而不仅仅是它的前缀。
- /
- \\1\\2\\3-用它代替<beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
- /
- g-全局替换。但是，由于正则表达式永不退缩，因此它将一次替换两个单词。
tl-如果最后一个s命令成功，则跳转到标签l。这是在这里，因此，当有3个相同的单词(如true true true)时，它们将被单个true正确替换。

如果没有\$^\\|[^[:alpha:]]\$和\$$\\|[^[:alpha:]]\$，例如true rue，则将它们替换为true，因为后缀rue rue将匹配。

下面是我的另一种解决方案，该解决方案还删除了跨行重复的单词。

我的第一个解决方案是使用uniq。因此，首先我将输入转换为格式为<non-alphabetical sequence separating words encoded in hex> 的对。然后通过uniq -f1忽略第一个字段来运行它，然后转换回去。这将非常慢：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\\x00&\\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
# ouptut hexadecimal representation of non-word
printf"%s""$1" | xxd -p | tr -d"\
"
# and output space with the word
printf" %s\
""$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
# just `printf"%s""$1" | sed 's/^-//' | xxd -r -p` for posix shell
# change non-word from hex to characters
printf"%s""${1:1}" | xxd -r -p
# output word
printf"%s""$2"
' --

但是随后我注意到sed在对输入进行标记化方面做得很好-它在每个单词和非单词标记之间放置了零字节。这样我就可以轻松阅读流。我可以通过在GNU awk中读取零分隔流并比较最后读取的单词来忽略awk中的重复单词：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\\x00&\\x00/g' |
gawk -vRS='\\0' '
NR%2==1{
nonword=$0
}
NR%2==0{
if (length(lastword) && lastword != $0) {
printf"%s%s", lastword, nonword
}
lastword=$0
}
END{
printf"%s%s", lastword, nonword
}'

代替零字节的某些东西可以用作记录分隔符，例如^字符，这样它可以与非GNU awk版本一起使用，并通过repl上的mawk进行测试。通过在此处使用较短的变量名称来缩短脚本：

1
2
3
4
5
6
7
8
9
10
11

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
NR%2{ n=$0 }
NR%2-1 && length(l) && l != $0 { printf"%s%s", l, n }
NR%2-1 { l=$0 }
END { printf"%s%s", l, n }
'

已通过repl。测试。片段输出：

1 2	true, rohith Rohith; cold burn, and fact and fact good?