关于变量:为什么Java允许控制字符在其标识符中?

Why does Java allow control characters in its identifiers?

奥秘

在准确地探索Java标识符中允许哪些字符时,我偶然发现了一些非常奇怪的东西,看起来几乎肯定是一个bug。

我希望Java标识符符合他们从具有Unicode属性EDOCX1×0的字符开始的要求,其次是属性EDCX1(1)的字符,而对于领先的下划线和美元符号则有例外。但事实并非如此,我发现这与我所听说的正常标识符或其他任何概念存在极大的差异。

简短演示

考虑下面的示例,证明在Java标识符中允许ASCII ESC字符(八进制033):

1
2
3
4
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 ="i am escape: \033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[

但比这更糟。事实上,几乎是无限糟糕。甚至可以为空!以及数千个甚至不是标识符字符的其他代码点。我已经在Solaris、Linux和运行达尔文的Mac上测试过了这一点,并且都给出了相同的结果。

长演示

这里是一个测试程序,它将显示所有这些意外的代码点,Java是相当合法地允许的,作为合法标识符名称的一部分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#!/usr/bin/env perl
#
# test-java-idchars - find which bogus code points Java allows in its identifiers
#
#   usage: test-java-idchars [low high]
#   e.g.:  test-java-idchars 0 255
#
# Without arguments, tests Unicode code points
# from 0 .. 0x1000.  You may go further with a
# higher explicit argument.
#
# Produces a report at the end.
#
# You can ^C it prematurely to end the program then
# and get a report of its progress up to that point.
#
# Tom Christiansen
# tchrist@perl.com
# Sat Jan 29 10:41:09 MST 2011

use strict;
use warnings;

use encoding"Latin1";
use open IO =>":utf8";

use charnames ();

$| = 1;

my @legal;

my ($start, $stop) = (0, 0x1000);

if (@ARGV != 0) {
    if (@ARGV == 1) {
        for (($stop) = @ARGV) {
            $_ = oct if /^0/;   # support 0OCTAL, 0xHEX, 0bBINARY
        }
    }
    elsif (@ARGV == 2) {
        for (($start, $stop) = @ARGV) {
            $_ = oct if /^0/;
        }
    }
    else {
        die"usage: $0 [ [start] stop ]
"
;
    }
}

for my $cp ( $start .. $stop ) {
    my $char = chr($cp);

    next if $char =~ /[\s\w]/;

    my $type ="?";
    for ($char) {
        $type ="Letter"      if /\pL/;
        $type ="Mark"        if /\pM/;
        $type ="Number"      if /\pN/;
        $type ="Punctuation" if /\pP/;
        $type ="Symbol"      if /\pS/;
        $type ="Separator"   if /\pZ/;
        $type ="Control"     if /\pC/;
    }
    my $name = $cp ? (charnames::viacode($cp) ||"<missing>") :"NULL";
    next if $name eq"<missing>" && $cp > 0xFF;
    my $msg = sprintf("U+%04X %s", $cp, $name);
    print"testing \\p{$type} $msg...";
    open(TESTPROGRAM,">:utf8","testchar.java") || die $!;

print TESTPROGRAM <<"End_of_Java_Program";

public class testchar {
    public static void main(String argv[]) {
        String var_$char ="variable name ends in $msg";
        System.out.println(var_$char);
    }
}

End_of_Java_Program

    close(TESTPROGRAM) || die $!;

    system q{
        ( javac -encoding UTF-8 testchar.java \
            && \
          java -Dfile.encoding=UTF-8 testchar | grep variable \
        ) >/dev/null 2>&1
    };

    push @legal, sprintf("U+%04X", $cp) if $? == 0;

    if ($? && $? < 128) {
        print"<interrupted>
"
;
        exit;  # from a ^C
    }

    printf"is %s in Java identifiers.
"
,  
        ($? == 0) ? uc"legal" :"forbidden";

}

END {
    print"Legal but evil code points: @legal
"
;
}

下面是在前33个既不是空白字符也不是标识符字符的代码点上运行该程序的示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ perl test-java-idchars 0 0x20
testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers.
testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers.
testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers.
testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers.
testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers.
testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers.
testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers.
testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers.
testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers.
testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers.
testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers.
testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers.
testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers.
testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers.
testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers.
testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers.
testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers.
testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers.
testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers.
testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers.
testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers.
testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers.
Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B

下面是另一个演示:

1
2
3
4
5
6
7
$ perl test-java-idchars 0x600 0x700 | grep -i legal
testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers.
testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers.
testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers.
testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers.
testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers.
Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD

问题

有人能解释一下这种看似疯狂的行为吗?这里有许多,许多,许多其他无法解释的被允许的代码点,从U+0000开始,这可能是最奇怪的。如果在第一个0x1000代码点上运行它,您确实会看到出现某些模式,例如允许使用Current_Symbol属性的任何和所有代码点。但至少在我看来,太多的事情是完全无法解释的。


Java语言规范部分3.8遵从字符。ISJavaIdIsAcistSistar()和字符。除其他条件外,后者还具有character.isIdentifierIgnorable(),它允许非空白控制字符(包括整个c1范围,请参见列表的链接)。


另一个问题可能是:为什么Java不应该允许标识符中的控制字符?

在设计一种语言或其他系统时,一个好的原则是不要无正当理由地禁止任何东西,因为你永远不知道它是如何被使用的,而且规则实现者和用户必须面对的越少越好。

确实,您不应该利用这一点,通过在变量名中嵌入转义,您将不会看到任何流行的库公开其中包含空字符的类。

当然,这可能会被滥用,但是用这种方式保护程序员不受影响并不是语言设计者的工作,而不仅仅是强制使用正确的缩进或精心选择的变量名。


我不知道有什么大不了的。这对你有什么影响?

如果开发人员想要混淆他的代码,他可以使用ASCII来实现。

如果开发人员想让他的代码易于理解,他将使用行业的通用语言:英语。标识符不仅是ASCII码,而且来自普通的英语单词。否则,没有人会使用或阅读他的代码,他可以使用任何他喜欢的疯狂字符。